DataONEorg / d1_synchronization

The CN synchronization service
0 stars 1 forks source link

Sync on production can crash ungracefully #1

Open amoeba opened 3 years ago

amoeba commented 3 years ago

We noticed an out-of-sync state between the production CN and urn:node:ARCTIC the other day and found the CN thought it was completely in sync when it wasn't. In this particular case, the CN had failed to pick up tens of System Metadata updates from urn:node:ARCTIC we were expecting to see and the CN may have missed many more. I messaged @taojing2002 for help and we found that sync had crashed due to being OOM. Our fix was to set the last harvest timestamp back a day and allow processing to run. My immediate thoughts are:

We talked about possible next steps on our dev call this week and came up with:

  1. Bump max heap (Xmx) on the process. This might not be possible due to limited resources on cn-ucsb-1.
  2. Move sync (and processing?) over to another host with more resources
  3. We might consider making MN's responsible for auditing (Note: Bryce thinks this is not quite the route to go but it's an idea that came up nonetheless)
  4. In the mean time before a fix, we could consider auditing sync on some of our more active member nodes (ARCTIC, ESS-DIVE, RW)
  5. Set up monitoring on our logs to detect crashes like this
  6. Work on figuring out the bugs at the top of this post

For now, @taojing2002 is going to look into this and coordinate with @datadavev and we can go from there.

[Note: This might on the wrong repo since I can't see our logs on cn-ucsb-1 to see what actually crashed. Feel free to move.]

nickatnceas commented 3 years ago

Regarding 1, the memory usage graphs show ~40% average memory use on cn-ucsb-1 over the last month with the current 64 GB allocated. We can easily go to 128 GB, higher is also an option, with max going to be ~700 GB.

amoeba commented 3 years ago

Thanks @nickatnceas, that's great to know.