chandra-mta / MTA

0 stars 0 forks source link

Dumps gzip and otg summary #29

Closed william-aaron-CFA closed 7 months ago

william-aaron-CFA commented 10 months ago

This PR addresses two groups of issues found in the Dumps set of processing scripts. The complete details of how these issues were discovered, proof and testing of what caused them, and the application of solutions can be found in the following two google documents located in the MTA drive.

To summarize the contents of these google documents, this PR addresses issues #20,#27,#28. In essence the Dumps_mon set of scripts failed to send a warning email to the ACIS team upon failure of the 1CRBT thermal resistor on Sept 28 2023. This failure to send the email was caused by a single preventative issue

  1. A safemode occurring on Feb 13 2023 kept ACIS offline for 10 days, generating a buildup of Dump files which were left unprocessed. Once regular script running began again by Feb 24 with an influx of new Dump files, the gzip command was called on all 10 days worth of back data at once and failed to unzip the Dumps files, causing the script to fail to generate the necessary .tl files used by Dumps_mon to send alert emails. When the scrip would not complete correctly, it would fail to remove future Dump files added by later script runs, adding to the large backlog of gzip dump files. Removing these gzip files allowed for regualr script running to process and a portion of this PR addresses the gzip command such that it will no longer be called on too many fiels at once.

In addressing this issue, it was found that the OTG move processing section of Dumps was not being completed either for the following three preventative issues.

  1. As mentioned in the FDB agenda of September 22 2022 (https://occweb.cfa.harvard.edu/occweb/web/fdb_web/Agenda/2022_Agenda/22_Sep_22_Items.html) , The method of recording OTG grating motion was switched from telemetry format 4 to format 6. The tl file parsing algo deliberately removes any data line not matching fmt4, leading to no moves ever being found. The algo has been changed to search for FMT6 data liens instead.
  2. The acorn environment variables set by ascds are overridden in the run_filters_script.py script. by overriding the IPCL_DIR variable specifically, the tl telemetry files will not output FMT6 related data line, thereby causing the OTG moves to yet again be missed, as there are not formatted to the DS teams template for acorn. The change in these variables has been removed so that acorn variables follow DS team specifications from the .ascrc file.
  3. The regular script run of run_filters_script.py expects to run acorn for extract sim data. This appears to have been moved to the script set located in /data/mta/Script/SIM however that processing was not entirely removed in this Dumps set of scripts. The algo has been changed to process a variable list of category of MSID's for both OTG moves, and the CCDM category of files fed to the Dumps_mon set of scripts.

This change has been testing in following locations. A demonstration of the differences between running acorn with the IPCL_DIR variable being overwritten can be found in /data/mta/TEST/waaron/acorn_FMT6_failure. A demonstration of the full script changes can be found in /data/mta/TEST/waaron/test_Dumps.

Running with test_Dumps test yourself will require the following setup.

  1. ssh to mta@c3po-v in order to have viewing access to
  2. Copy the processed_list from /data/mta/Script/Dumps/Scripts/house_keeping to /data/mta/TEST/waaron/test_Dumps/Scripts/house_keeping, editing the last line to remove the most recently processed Dumps file.
  3. Run the run_sed shell script located in the Script directory to change relevant pathing in running script to the test directory pathing.
  4. run the following command call which mimics the cronjob call: (cd /data/mta/TEST/waaron/test_Dumps/Exc/; /data/mta/TEST/waaron/test_Dumps/Scripts/run_otg_wrap_script)