ckan / ckanext-harvest

Remote harvesting extension for CKAN
130 stars 204 forks source link

Allow filtering of remote datasets to be harvested #155

Open rossjones opened 9 years ago

rossjones commented 9 years ago

It would be very useful if there was a way of telling the CKAN harvester how to limit the datasets it harvests. For instance, by a specific extra, or the presence of the dataset in a specific organisation.

amercader commented 9 years ago

Totally agree. The CKAN harvester needs to be refactored to use package_search on the remote CKAN instead of the old REST API anyway. Once this is done it would be a matter to pass extra filters on the source config.

filipefigcorreia commented 9 years ago

I've started working on something with the same goal but using a different approach. I'm not using package_search, I've just added a new extension point.

Anyway, using package_search seems like the way to go to me too. And I'm sure that using the v3 API (where package_search belongs to) would also help solving other issues.

rossjones commented 9 years ago

@filipefigcorreia we did the same sort of thing (but this time just using config for organization_filter_include/organization_filter_exclude) because of time constraints - https://github.com/datagovuk/ckanext-harvest/commit/01fdbbf682c007a38e98c06065a48b5b8addbe65

It works, but I think it's a little inelegant just because it is doing more work than it really needs to at runtime (we can only include/exclude after we've fetched) - I think a move to v3 of the API and a way to pass filters to the search would be much more efficient longer term.

davidread commented 9 years ago

PR for this: https://github.com/ckan/ckanext-harvest/pull/168