ckan / ckanext-spatial

Geospatial extension for CKAN
http://docs.ckan.org/projects/ckanext-spatial
126 stars 193 forks source link

ckan-pycsw load fails with 3.5M datasets #214

Open adborden opened 5 years ago

adborden commented 5 years ago

The ckan-pycsw load job isn't built to handle a large number of datasets. It pulls all the datasets in memory, then all the existing pycsw records, then does set operations in order to figure out new, changed, and deleted datasets. We started seeing the job run out of memory on the machine when working with 3.5 million datasets in CKAN. Additionally, as the datasets grow, the job expects to be the sole worker, running as a cron job once per day. It would be nice if this work could be split up over time and machines.

As a hack, I did some work to fetch datasets in batches of 1000 and process them. But ultimately, I think you would want the pycsw update to happen in "real time" as part of harvesting. If the dataset is updated, it should be updated in pycsw. If the package is deleted, it should be removed. If the dataset doesn't exist, add it.

kalxas commented 5 years ago

This work has already been done as part of the PublicaMundi EU project: https://github.com/PublicaMundi https://github.com/PublicaMundi/ckanext-publicamundi We are going to port this to the latest CKAN in the next 6 months

adborden commented 5 years ago

@kalxas awesome, thank you! Let me know if I can help this effort. Are you planning on adding this to ckanext-spatial?

kalxas commented 5 years ago

No, this is specific to ckanext-publicamundi work and depends on custom metadata schema plugin we implemented in order to fully support ISO19115 in CKAN. Our plan is to release this work in several extensions within 2019