LazeMSS / OctoPrint-TopTemp

Topbar temperature plugin for OctoPrint
21 stars 2 forks source link

BUG: Sensor error/offline requires restart of octoprint #93

Open puterboy opened 1 year ago

puterboy commented 1 year ago

I had a sensor offline for a while so that the custom command returned nothing (null string) When the sensor came back online, it still had the display frozen at the last good value and the graph showing no data with the x-axis labeled as all zeros and the avg/min/max showing Null.

Refreshing the web browser didn't help. Going into the settings for TopTemp and running 'test' showed that the sensor command was working again... just somehow the lack of good data had frozen the display and borked the graph history.

The only way I could fix it was by restarting octoprint (which is obviously non-ideal for many reasons)

The lack of sensor data (or even the presentation of 'bad' sensor data) should be handled gracefully... :)

puterboy commented 1 year ago

Actually, it seems like even one bad sample can trigger this...

puterboy commented 1 year ago

Note my ideal behavior in case of invalid or absent sensor data would be the following:

  1. Sensor displays NA in the ribbon
  2. Histogram skips the data point
LazeMSS commented 1 year ago

This I can not recreate - tried using a file for input and deleting it - basically it just skips the data Also tried creating a file returning junk - no problems Also tried creating a command that would cause a timeout.

Could you explain in details how to recreate - what data does the sensor return when failing?

puterboy commented 1 year ago

It seems to break if the command returns nothing.

Until that is fixed, try this to replicate:

bash -c "if [ -e /tmp/crap ]; then /bin/true; else date +%M; fi"

It should give a graph cycling from 0 to 60 (corresponding to the time in minutes)

Then after a few samples, type "touch /tmp/crap" in a shell (which simulates the sensor returning nothing) The displayed data is frozen at the last valid minute (as expected perhaps, though I would argue, the display should be NA). And then the histogram goes to all zeros labels on the X-axis and avg/min/max all equal to Null which.

Then after a few minutes type "rm /tmp/crap" in a shell (which will simulate sensor returning to normal). However, the above error state still persists! Even worse, it seems to require a restart of Octoprint to fix.

I believe this is related to my latest comment re-opening https://github.com/LazeMSS/OctoPrint-TopTemp/issues/88 i.e., having a command that returns nothing (e.g., /bin/true) causes the sampling to hang forever.

puterboy commented 1 year ago

Were you able to replicate with the above explanation?

LazeMSS commented 1 year ago

Have not tried yet.

LazeMSS commented 1 year ago

Fixed in: https://github.com/LazeMSS/OctoPrint-TopTemp/releases/tag/0.0.2.3

puterboy commented 1 year ago

I am still getting the problem whereby if a sensor is offline (and thereby returns an error) that the graph fails even when the sensor resumes... And that the problem can only be reset by restarting Octoprint.

i.e., still get a blank graph with NA for avg/min/max..

The correct behavior would be just to "skip" the time series points without valid data -- i.e. don't display the value (just display NA) or the graph (blank or gap in the graph) for those points and don't include them in the min/max/average.

puterboy commented 1 year ago

@LazeMSS - any ideas why this is still occuring?

LazeMSS commented 1 year ago

I'm on holiday will look later

puterboy commented 1 year ago

Enjoy your vacation!!!! I'm sure it's well-deserved. LazeMSS wrote at about 21:06:14 -0800 on Sunday, January 29, 2023:

I'm on holiday will look later

-- Reply to this email directly or view it on GitHub: https://github.com/LazeMSS/OctoPrint-TopTemp/issues/93#issuecomment-1408004528 You are receiving this because you authored the thread.

Message ID: @.***>

LazeMSS commented 1 year ago

I still can't recreate this.

puterboy commented 1 year ago

OK here is a command you can use to recreate the problem:

bash -c '(DATE=$(date +%s); if [ $DATE -lt TSTART-o $DATE -gt TEND]; then expr \( $DATE / 10 \) % 100; else sleep 100000 && echo 10; fi)'

where for TSTART substitute a value say a 100 seconds after the current epoch time (date +%s) and for TEND substitute a time say 200 seconds after the current epoch time. Set the sampling frequency to say 10 seconds.

puterboy commented 1 year ago

Note that fixing the problem in #88 doesn't fix this problem, Specifically, after fixing #88, the above command expression works under 'test' (whether before TSTART, between TSTART and TEND or after TEND) -- i.e. it returns the appropriate response (or timesout) when run manually as a 'test'.

However, when running as a sensor it displays updated sample data until TSTART, then it presumably timesout between TSTART and TEND with no new data -- so it is left displaying the last valid data before TSTART. However, after TEND, it doesn't resume displaying data. i.e. it is "stuck" with the last value before TSTART.

I wonder whether the problem occurs in the function 'runCustomMon' where a timeout would mean that self.handleCustomData is not called leaving potentially a 'hole' in the data time series which causes the display to fail/stop? Or perhaps the timeout occurs after the next time slice is called causing it to get confused?

Either way, this should allow you to reproduce the problem... and it is important to fix it since it's a very real case where the sensor may temporarily time out (or give invalid data).

Let me know if the above makes sense or you need any other info to reproduce.

Thanks

puterboy commented 1 year ago

Anything I can help do to fix the problem?