cmsdaq / DAQExpert

New expert system processing data model produced by DAQAggregator
1 stars 2 forks source link

Temperature checks #28

Open gladky opened 7 years ago

gladky commented 7 years ago

From the DAQDottoressa temperature monitor: tail -3 /var/log/danger_shutdown.sh.log | grep -i Temperature

  1. Is it the preferred way of obtaining temperature? Is it there any api?
  2. Are the thresholds up-to-date? 21 - normal, 25 - warning, 27 - warning 2, 30 - shutdown (based on code and messages from DAQDottoressa
  3. What temperature is it exactly? What exactly is "everything" that shuts down above 30 degrees (quote 2)?
  4. Is shutting down the action of DAQDottoressa? ("...At 30 degree I shut down..." from quote 1)

Quote 1:

===> BIIIG TROUBLE AHEAD <===. It's getting hot here. Temperature is now $temp. At 30 degree I shut down...and I take the entire computer cluster with me:-(\n(For your info: a normal temperature is 21 degree.)

Quote 2:

===> This is the end, my friend... The computer cluster is being cooked ($temp degree). \nVery soon everything shuts down. Lean back and relax... there is nothing you can do... sad sad world...\nIt is up to you to decide if you want to inform your control room friends now, or if you leave them some more happy minutes \nuntil they realize. Once the cluster really goes down please call the DAQ expert (76600) also during night time.\n(...I hope that at least your phone is heat resistant...)\n

andreh12 commented 7 years ago
  1. shutting down is not performed by the DAQDottoressa (if I am not mistaken) but rather by our danger_shutdown service installed on every machine.

3a. looking at /usr/local/bin/temperature_lib.sh on a dvbu machine, looking at the shell function GetTemperature() one can see that the temperature sensor names (e.g. FCB Ambient1 or BB Inlet Temp) depend on the type of server, so a generic expression like 'server temperature' probably describes it best.

3b. 'everything' in this case is defined as 'everything with the danger_shutdown installed and properly configured'. After some more reading, I noticed however that there is no ssh involved so the message of the DAQDottoressa refers only to the host it is running on (i.e. it only parses its own log), warning the shifter that the DAQDottoressa may be stopped soon if the temperature watching service triggers a shutdown.

andreh12 commented 7 years ago

Looking at the code of the original DAQDottoressa, it was only checking the temperature of the machine where the DAQDottoressa was running, not of others.

Our virtual machines (such as the one where the daq expert is running on) do not have the file /var/log/danger_shutdown.sh.log (the deamon shutting down the machine in case of two high temperature should run on the host system, not on the virtual machine), so currently there is obvious way how to implement this check on the DAQExpert.

I would vote to close this issue.