ckan / ckanext-harvest

Remote harvesting extension for CKAN
130 stars 203 forks source link

Filtering orgs & groups don't work while using multiple values #518

Open wanam opened 1 year ago

wanam commented 1 year ago

I'm trying to harvest a ckan data source https://dati.comune.roma.it/catalog using bellow filter configuration:

{
"organizations_filter_include":["atac-s-p-a-azienda-per-la-mobilita","roma-servizi-per-la-mobilita"]
}

Ckan harvester converts the above configuration to a formatted url to gather the datasets's metadata, but it seems the syntax of the resulting url is not working properly, here is the generated url: https://dati.comune.roma.it/catalog/api/3/action/package_search?rows=100&start=0&sort=id+asc&fq=organization%3Aatac-s-p-a-azienda-per-la-mobilita+OR+organization%3Aroma-servizi-per-la-mobilita

Sending this request will return all the datasets in the remote data catalog ~340, while it should only return ~11 datasets.

I'm not sure if this is a ckan querying compatibility issue, this issue is reproducible on ckan 2.9.

The correct url format should be: https://dati.comune.roma.it/catalog/api/3/action/package_search?rows=100&start=0&sort=id+asc&fq=organization%3A(atac-s-p-a-azienda-per-la-mobilita OR roma-servizi-per-la-mobilita)

Here is a quick fix I'm using on my ckan instance:

        org_filter_include = self.config.get('organizations_filter_include', [])
        org_filter_exclude = self.config.get('organizations_filter_exclude', [])
        if org_filter_include:
            fq_terms.append('organization:(')
            fq_terms.append(' OR '.join(org_name for org_name in org_filter_include))
            fq_terms.append(')')
        elif org_filter_exclude:
            fq_terms.append('-organization:(')
            fq_terms.append(' OR '.join(org_name for org_name in org_filter_exclude))
            fq_terms.append(')')

Same thing applies for groups and tags.