ckan / ckanext-harvest

Remote harvesting extension for CKAN
130 stars 203 forks source link

gather_consumer creates/reads harvester obj ids in memory... and It causes memory consuming #291

Open francescomangiacrapa opened 7 years ago

francescomangiacrapa commented 7 years ago

Hi all, I've two CKAN (i.e. A and B). The first one (A) performs harvesting of the second one (B) B contains 63.4K items. A is filled harvesting data/dataset from B. On A If both gather_consumer and fetcher_consumer are active (via supervisorctl) the gather_consumer starts every 3 hours causing memory consuming, in particular I've detected the following:

[Thu Mar  9 00:03:27 2017] [26161]  1000 26161   294449    22539   3       0             0 paster
[Thu Mar  9 00:03:27 2017] [24669]  1000 24669  3713516  2918978   3       0             0 paster
[Thu Mar  9 00:03:27 2017] Out of memory: Kill process 24669 (paster) score 885 or sacrifice child
[Thu Mar  9 00:03:27 2017] Killed process 24669 (paster) total-vm:14854064kB, anon-rss:11675912kB, file-rss:0kB
[Thu Mar  9 03:08:24 2017] [26161]  1000 26161   305904    19692   3       0             0 paster
[Thu Mar  9 03:08:24 2017] [ 1856]  1000  1856  3626745  2914998   0       0             0 paster
[Thu Mar  9 03:08:24 2017] [16142]  1000 16142   164180     6791   2       0             0 paster
[Thu Mar  9 03:08:24 2017] Out of memory: Kill process 1856 (paster) score 863 or sacrifice child
[Thu Mar  9 03:08:24 2017] Killed process 1856 (paster) total-vm:14506980kB, anon-rss:11659988kB, file-rss:4kB
[Thu Mar  9 03:08:27 2017] paster: page allocation failure: order:0, mode:0x280da
[Thu Mar  9 03:08:27 2017] Pid: 1856, comm: paster Not tainted 3.2.0-4-amd64 #1 Debian 3.2.51-1
[Thu Mar  9 06:07:25 2017] [26161]  1000 26161   315278    24475   2       0             0 paster
[Thu Mar  9 06:07:25 2017] [16142]  1000 16142   178859    15891   0       0             0 paster
[Thu Mar  9 06:07:25 2017] [16564]  1000 16564  3565745  2840783   2       0             0 paster
[Thu Mar  9 06:07:25 2017] Out of memory: Kill process 16564 (paster) score 848 or sacrifice child
[Thu Mar  9 06:07:25 2017] Killed process 16564 (paster) total-vm:14262980kB, anon-rss:11363112kB, file-rss:20kB
[Thu Mar  9 09:06:48 2017] [26161]  1000 26161   326024    25602   0       0             0 paster
[Thu Mar  9 09:06:48 2017] [30777]  1000 30777  3555701  2828690   0       0             0 paster
[Thu Mar  9 09:06:48 2017] Out of memory: Kill process 30777 (paster) score 845 or sacrifice child
[Thu Mar  9 09:06:48 2017] Killed process 30777 (paster) total-vm:14222804kB, anon-rss:11314760kB, file-rss:0kB

gather_consumer (every 3 hours) starts and it is killed in few minute by the system after reaching of 12 GB of memory usage and than supervisorctl restart it.. Is it possible a misconfiguration in my system? However ckecking the code, it seems that harvester obj ids are created in memory, is it true? In this case, when item/dataset are so many some "Out of memory" are possible.. :-(

Is there a way/configuration in oder to create such harvester obj ids on disk instead of in memory? Or another way to fix my issue..

Thanks a lot for your feedback, Francesco

seitenbau-govdata commented 7 years ago

Hi Francesco,

we have the same problem when harvesting a remote CKAN with about 50K datasets. Our workaround to get the harvesting jobs working again is to clear the whole job history with the paster command "clearsource_history" (without deleting the imported datasets). We start harvesting every two days and the problem occurs about after a week, so five times of harvesting.

Maybe it is a solution to use the redis database storing the datasets temporally, as it was used between gather and fetch stage before getting the dataset content already in the gather stage.

Cheers, Ralph

francescomangiacrapa commented 7 years ago

After had performed "clearsource_history" and started harvesting job again, for first time the memory usage is 2.7GB on Avg, it is smaller that 12GB of the past so the workaround seems work :+1:
I saved update frequency at 'Daily' to check when memory consuming occurs next time. However, I think that the best solution is to use the redis database to store dataset ids retrieved by gather_consumer instead of keep them in memory and than perform harvesting stage in phases (i.e. chunking the items within start, limit, offset etc.. read from DB) Have you news of any improvement in this regard, @seitenbau-govdata?

Thanks again for you response.

Cheers, Francesco

seitenbau-govdata commented 7 years ago

Hi Francesco,

good to hear that the workaround is working for you, too. I think there are two options to improve the code:

  1. Instead of paging the search results and saving all dataset dictionaries in memory, the dataset dictionaries could be saved to the database directly in every loop. I know there are some corrections afterwards, e.g. clearing duplicates, but I think that could be managed in another way.
  2. The result of dataset dictionaries in every loop could be stored in the redis database instead in memory. But I don't know if with redis the memory consumption get better, because I think redis is working in memory as well.

Cheers, Ralph