GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
546 stars 87 forks source link

Block harvest_source_list API endpoint on catalog #4725

Open FuhuXia opened 2 months ago

FuhuXia commented 2 months ago

Endpoint from ckanext-harvest harvest_source_list includes deleted harvest sources in the result. Anonymous user is not supposed to see deleted packages. The API does not support pagination. In order to show catalog's all harvest sources, we have to set a very high limit (2000?) to include all current (active) and deleted (inactlive) sources in one API call, which is very slow.

I think we should block this API endpoint and guide user to use alternative APIs

  1. Call this API to get all harvest sources in paginated results: https://catalog.data.gov/api/action/package_search?fq=(dataset_type:harvest)&fl=id,name,url,organization&rows=1000

  2. Get details on a specific source with this API. You can use either id or name: https://catalog.data.gov/api/action/harvest_source_show?id=energy-json

How to reproduce

https://catalog.data.gov/api/action/harvest_source_list

search active: false in the result

Sketch

We have a list of blocked api endpoint in nginx config:

https://github.com/GSA/catalog.data.gov/blob/8dda50797980f40d6921aa3e299087ddfe31d8c9/proxy/nginx-common.conf#L27-L44

gujral-rei commented 2 months ago

Redirect the call to package search API call.