Open francescomangiacrapa opened 7 years ago
Hi Francesco,
we have the same problem when harvesting a remote CKAN with about 50K datasets. Our workaround to get the harvesting jobs working again is to clear the whole job history with the paster command "clearsource_history" (without deleting the imported datasets). We start harvesting every two days and the problem occurs about after a week, so five times of harvesting.
Maybe it is a solution to use the redis database storing the datasets temporally, as it was used between gather and fetch stage before getting the dataset content already in the gather stage.
Cheers, Ralph
After had performed "clearsource_history" and started harvesting job again, for first time the memory usage is 2.7GB on Avg, it is smaller that 12GB of the past so the workaround seems work :+1:
I saved update frequency at 'Daily' to check when memory consuming occurs next time.
However, I think that the best solution is to use the redis database to store dataset ids retrieved by gather_consumer instead of keep them in memory and than perform harvesting stage in phases (i.e. chunking the items within start, limit, offset etc.. read from DB)
Have you news of any improvement in this regard, @seitenbau-govdata?
Thanks again for you response.
Cheers, Francesco
Hi Francesco,
good to hear that the workaround is working for you, too. I think there are two options to improve the code:
Cheers, Ralph
Hi all, I've two CKAN (i.e. A and B). The first one (A) performs harvesting of the second one (B) B contains 63.4K items. A is filled harvesting data/dataset from B. On A If both gather_consumer and fetcher_consumer are active (via supervisorctl) the gather_consumer starts every 3 hours causing memory consuming, in particular I've detected the following:
gather_consumer (every 3 hours) starts and it is killed in few minute by the system after reaching of 12 GB of memory usage and than supervisorctl restart it.. Is it possible a misconfiguration in my system? However ckecking the code, it seems that harvester obj ids are created in memory, is it true? In this case, when item/dataset are so many some "Out of memory" are possible.. :-(
Is there a way/configuration in oder to create such harvester obj ids on disk instead of in memory? Or another way to fix my issue..
Thanks a lot for your feedback, Francesco