Closed prjemian closed 2 years ago
Two days ago (a Monday shutdown day), there were two simultaneous instances running, both using >100% CPU.
Likely, this is due to a large number of scans in one SPEC data file. The current parser will reprocess the entire file if the file has changed. For one of the most recent files (2019-07/07_26_Strumendo.dat
), there are 3690 scans.
This is an additional consideration for issue #23. There is a duplicate issue in spec2nexus.
Observed high CPU load again 8/6/2020. Checked top + cat /proc/PID/cmdline and found, process was scanning files in 2020-06. I wonder how long, since I cleaned up the xml file before we started August run. May be it was hanging and running forever?
If it is using current scan log, it should have been limited to whatever is there - 2020-08 folder only. Note, we are 6k data sets in the scan log in this August run, slightly more than week into our 2 weeks run. More and more data produced...
We should do more to limit the scope now. Too much data generated. We should look at scanlog AND limit to last 50-100 entries. May be even 20 entries. Reason to redo the graphs is to make sure we generate last version of all data reliably. The worry was, that if we are collecting data and processing is done before data collection is finished, we could have partial graph. But, in reality we never need to reprocess more than few last data sets, since all data sets are collected sequentially. And scan log has time stamps in it... That will reduce scope significantly and make everything work fine. The only question is how to make sure we do not miss any. If we would generate 100 data sets in 5 minutes and would be processing only the last 50, we could miss half of data sets. Assuming cron job fires this in 5 minutes.
In reality, we rarely collect data faster than 20seconds/data set. Take cron interval, multiply by 5 and we have safe number of scans to reprocess... And if new miss any, we may need to increase this number a bit. Missing few of many data sets for this is NOT critical.
Present code looks back 1.5 weeks for scans: https://github.com/APS-USAXS/livedata/blob/deb7480796121bcb6a3b65b794d913c88393bb3e/recent_spec_data_files.py#L19 https://github.com/APS-USAXS/livedata/blob/deb7480796121bcb6a3b65b794d913c88393bb3e/recent_spec_data_files.py#L38 https://github.com/APS-USAXS/livedata/blob/deb7480796121bcb6a3b65b794d913c88393bb3e/recent_spec_data_files.py#L66-L73
This allows for the process to fall behind and eventually catch up. WHen the CPU load is high due to this process, it is a sign that something in the data file is, well, er, wonkly.
By your measure above, we could modify RECENT = 30*MINUTE
(6 times the cron interval). Still, we need a better diagnostic of why processing is taking longer than expected. We also need an indication that processing is taking longer than expected.
Issue #42 might help here.
Note that when this runs and there is a lot of backlog to process, the utilization can reach 170%. Could make this less aggressive by adding time.sleep(interval)
. But need to make sure additional specplotsAllScans.py
processes do not start.
resolved in #50
Despite recent efforts (#28, #29),
specplotsAllScans.py
is consuming more than 100% CPU (as viewed usingtop
). Why? Can this be reduced?