APS-USAXS / livedata

live data from the APS USAXS instrument
https://usaxslive.xray.aps.anl.gov/
0 stars 0 forks source link

specplotsAllScans.py using >100% CPU #34

Closed prjemian closed 2 years ago

prjemian commented 4 years ago

Despite recent efforts (#28, #29), specplotsAllScans.py is consuming more than 100% CPU (as viewed using top). Why? Can this be reduced?

usaxs@usaxscontrol /share1/USAXS_data $ top

top - 07:32:24 up 74 days, 14:46, 26 users,  load average: 3.58, 3.35, 3.01
Tasks: 505 total,   2 running, 502 sleeping,   0 stopped,   1 zombie
%Cpu(s): 36.6 us,  8.6 sy,  0.0 ni, 54.5 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem : 15967812 total,   791260 free,  5128652 used, 10047900 buff/cache
KiB Swap: 33554428 total, 32772092 free,   782336 used. 10264664 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                
 90948 usaxs     20   0  776320 128260  15244 R 154.8  0.8   4:03.67 python                                                                 
103294 usaxs     20   0 1902500 929920  11836 S 103.7  5.8   4049:12 python                                                                 
 15280 usaxs     20   0 1635668 352556  19664 S  20.9  2.2   4965:13 python                                                                 
 85946 usaxs     20   0  712276  19896   5484 S  18.9  0.1  11730:03 oscillate.py                                                           
 11184 usaxs     20   0   83900  37460   6312 S   8.0  0.2 540:26.02 python                                                                 
 72887 usaxs     20   0 1610176 141532  23860 S   3.7  0.9  38:40.78 caQtDM                                                                 
 41525 usaxs     20   0 1393052 115928  11492 S   3.3  0.7   1731:16 caQtDM                                                                 
...
usaxs@usaxscontrol /share1/USAXS_data $ ps 90948 103294 15280
   PID TTY      STAT   TIME COMMAND
 15280 ?        Sl   4965:22 /APSshare/anaconda/x86_64/bin/python /home/beams/USAXS/Documents/eclipse/USAXS/livedata/pvwatch.py
 90948 ?        Rl     4:27 /APSshare/anaconda/x86_64/bin/python /home/beams/USAXS/Documents/eclipse/USAXS/livedata/specplotsAllScans.py /sh
103294 ?        Sl   4049:17 /APSshare/anaconda/x86_64/bin/python /home/beams/USAXS/Documents/eclipse/USAXS/scanLog/scanLog.py
prjemian commented 4 years ago

Two days ago (a Monday shutdown day), there were two simultaneous instances running, both using >100% CPU.

Likely, this is due to a large number of scans in one SPEC data file. The current parser will reprocess the entire file if the file has changed. For one of the most recent files (2019-07/07_26_Strumendo.dat), there are 3690 scans.

prjemian commented 4 years ago

This is an additional consideration for issue #23. There is a duplicate issue in spec2nexus.

jilavsky commented 3 years ago

Observed high CPU load again 8/6/2020. Checked top + cat /proc/PID/cmdline and found, process was scanning files in 2020-06. I wonder how long, since I cleaned up the xml file before we started August run. May be it was hanging and running forever?

If it is using current scan log, it should have been limited to whatever is there - 2020-08 folder only. Note, we are 6k data sets in the scan log in this August run, slightly more than week into our 2 weeks run. More and more data produced...

We should do more to limit the scope now. Too much data generated. We should look at scanlog AND limit to last 50-100 entries. May be even 20 entries. Reason to redo the graphs is to make sure we generate last version of all data reliably. The worry was, that if we are collecting data and processing is done before data collection is finished, we could have partial graph. But, in reality we never need to reprocess more than few last data sets, since all data sets are collected sequentially. And scan log has time stamps in it... That will reduce scope significantly and make everything work fine. The only question is how to make sure we do not miss any. If we would generate 100 data sets in 5 minutes and would be processing only the last 50, we could miss half of data sets. Assuming cron job fires this in 5 minutes.

In reality, we rarely collect data faster than 20seconds/data set. Take cron interval, multiply by 5 and we have safe number of scans to reprocess... And if new miss any, we may need to increase this number a bit. Missing few of many data sets for this is NOT critical.

prjemian commented 3 years ago

Present code looks back 1.5 weeks for scans: https://github.com/APS-USAXS/livedata/blob/deb7480796121bcb6a3b65b794d913c88393bb3e/recent_spec_data_files.py#L19 https://github.com/APS-USAXS/livedata/blob/deb7480796121bcb6a3b65b794d913c88393bb3e/recent_spec_data_files.py#L38 https://github.com/APS-USAXS/livedata/blob/deb7480796121bcb6a3b65b794d913c88393bb3e/recent_spec_data_files.py#L66-L73

This allows for the process to fall behind and eventually catch up. WHen the CPU load is high due to this process, it is a sign that something in the data file is, well, er, wonkly.

prjemian commented 3 years ago

By your measure above, we could modify RECENT = 30*MINUTE (6 times the cron interval). Still, we need a better diagnostic of why processing is taking longer than expected. We also need an indication that processing is taking longer than expected.

prjemian commented 3 years ago

Issue #42 might help here.

prjemian commented 3 years ago

Note that when this runs and there is a lot of backlog to process, the utilization can reach 170%. Could make this less aggressive by adding time.sleep(interval). But need to make sure additional specplotsAllScans.py processes do not start.

prjemian commented 2 years ago

resolved in #50