bio-guoda / idigbio-spark

processing engine for biodiversity archives
0 stars 1 forks source link

update monitors takes longer than a week #5

Open jhpoelen opened 7 years ago

jhpoelen commented 7 years ago

With 2 copies of GBIF (~600M 2), 4 copies of iDigBio (~60M 4), 6 copies of (~2M * 6), updating the monitors takes over a week. Adding more capacity (see #4) would definitely help this issue. Also, time can be spent on optimizing the algorithms used to calculate differences across ~1.5G occurrence records).

from http://archive.guoda.bio - screenshot from 2017-05-17 17-34-59

jhpoelen commented 7 years ago

And some more from spark job monitor - screenshot from 2017-05-17 17-37-00

jhpoelen commented 7 years ago

Turns out that the /mnt/data was running out of space. (97% of 1.8T). I've cleaned out some kafka logs and made a tiny change in idigio-spark job to reduce memory-disk pressure to avoid big jobs (like update monitors) to stall. @jhammock @mjcollin @godfoder Suggest to move forward on #4 to avoid duplicate maintenance efforts.