fgx / crossfeed-dailies

Daily summaries of FlightGear Multiplayer flights
https://fgx.github.io/crossfeed-dailies/
1 stars 0 forks source link

Seems supervisor did not restart the cfjsonlog.pl script! #10

Closed geoffmcl closed 7 years ago

geoffmcl commented 7 years ago

Hi Pete,

The cflog script was running good, under supervisor, for a few days, so I stopped monitoring it for a few days... and BANG ;=))

It stopped on 2016-08-21, and for some reason supervisor did not restart the script ????????

When I entered $ sudo supervisorctl it showed an error line -

cflog          FATAL    Exited too quickly (process log may have details)

Anyway, I restarted cflog, and it is now outputting flights-2016-08-24.csv ... over the coming days I will fill in the missing logs from my RPI2 collection, so no problem...

But just a question! Why did supervisor not restart the cflog job?

The super config $ /etc/supervisor/conf.d/cflog.conf already contains -

autostart=true
autorestart=true

Is there anything else we need to do to make supervisor restart the script if it fails, for any reason...

I note the startretries=3. If super restarts the script too quickly, maybe the json fetch problem, the only known reason it stops, has not yet recovered...

Is there are way to config super to wait say 20 to 25 seconds before the restart... then 3 retries over the next minute or so might make sense...

But am concerned that if super restarts the script virtually immediately, then it may be easy to get 3 similar fetch errors within the first few seconds...

Will keep monitoring the situation...

Regards, Geoff.

pedromorgan commented 7 years ago

Have a look at the stopwaitsecs and startsecs... here http://supervisord.org/configuration.html#program-x-section-settings

Not sure what to do as am not familar with the stop signals for perl..

Also shall I create the supervisor log in /home/geoff/log/* ?? or alike

geoffmcl commented 7 years ago

Hi Pete,

Thanks for the hints, but...

stopwaitsecs, default 10, seems to be more about when supervisor, aka SV, wants to stop a process. How long to wait for exit. That does not seem a problem here...

startsecs, default 10, seems to be a measure of how long the process should run before it is considered running ok... so to me the default of 10 secs is ok...

No, this is about when SV detects the process has stopped, and when to restart it...

As stated, if there is a tiny glitch in the json http fetch, it is very likely that if SV restarts the script immediately, that the problem will still be there...

BUT, if SV waits 2-5 secs before restarting the script, then there is a good chance the very momentary json fetch problem will have cleared itself... that's all...

Is there a way to say to SV - take it easy kid! The process stopped. Leave it dead for 2-5 secs before restarting it... Seems like a reasonable option... but can not find it...

If there is a temporary problem, do not try to start the process immediately... wait some time for the problem to resolve itself... then restart the process...

I do not know how quickly SV detects a process down!

But is SV is very attentive... maybe too attentive... and restarts the process more that 3 times within the next second, then I can see this can be a problem!

As you know, the json string in crossfeed is only updated each seconds... If a json fetch fails, then trying again, several times, before at least 1, or more, seconds have passed, will probably mean more failures... calm down, and wait a little, and all may, will be well...

At some point, I will get around to moving this logic into the script... If a json fetch fails... do not panic, and exit... sleep/wait 5 seconds... and calmly read again... And let this happen say 3 to 10 times before actually bailing...

Yes, I have created a /home/geoff/log folder you can write a log to, but not sure this is going to solve the above maybe SV is too fast, too anxious problem...

And I too am not famiar with the stop signals for perl, except I would believe they are very similar across all script languages - perl, python, etc, etc... But really not sure how this is relevant...

As stated, I will get around to adding this calmness to the script itself... some day soon... actually waiting the the next RPI2 acquisition to break, to exactly narrow this down... and work on it...

All it seems I want at the moment is for SV to take a little time off ;=)) calm down... take a sip of red wine... a short nap... and then autorestart the process...

Regards, Geoff.

pedromorgan commented 7 years ago

Supervisor probably does not do what we want.. It designed to keep processes running, ie zero downtime

pedromorgan commented 7 years ago

Why is the script crashing anyway.. ?

pedromorgan commented 7 years ago

Another quick hack would be to launch from a bash.script and make is sleep for X seconds before kicking off the pl script

geoffmcl commented 7 years ago

Hi Pete,

Ok, understand supervisor is a tool to keep maximum up-time, and it seems to do a good job of that... so no problem there...

Yes, I too thought of the quick hack to run the pl from a bash script that first slept for a few seconds...

But as stated I am waiting for this to happen again on RPI2... I think I know which pl line to fix, but just want to be sure... and if I am right, should be an easy fix...

See lines - https://github.com/geoffmcl/scripts/blob/master/cfjsonlog.pl#L346-L348

That service fetch_url() can return an undefined, or some truncated, perhaps not valid json string and it is in the json decode that bombs...

I could add some checks here, and if not a flights.json string, just sleep the script for a few seconds, and quietly try again... for say at least the next minute or so...

But, as usual, it has been running several days/weeks without problem... and I am sure when it breaks in RPI2, it will probably break in m2 at the same time...

I guess it may go back to a fault in crossfeed (cf_client) code, on the rare occasion, there are no flights to report, due to the cf packet filtering...

Maybe at that moment cf forms some bad json, or 'nothing', or something like that... need also to look deeper into this...

See service Write_JSON(), from line 532 - https://gitlab.com/fgtools/crossfeed/blob/master/src/cf_pilot.cxx#L532

This service establishes the next json string to return... note it uses pvList->size(), but skips expired pilots... if this is zero, at that time what exactly does the http json GET request return?

As stated it seems quite a rare event... but am on its tail ;=))

And as I now feel Mr SuperVisor is doing his job, will close this issue... ;=))

Thanks, Geoff.

geoffmcl commented 7 years ago

Ok, it broke again in RPI2. exactly where predicted... BUT, not in m2!!!!

Maybe the auto restart 1, 2, ... were already in the next crossfeed 1 second cycle...

Made a kludge fix in the pl script, and that is to wait 5 seconds before doing the first fetch... there really is no big rush here... on start up...

Added some more precise timing... just debug stuff...

This is a quick fix, while I look deeper into avoiding it totally ;=))

Also decided to cycle the log to .1, .2, ... on restart, to be able to look back... but this presents a log cleanup to be done later...

TODO: Get the CSV files synced... TODO: Add a log cycling system

On to the next steps...

pedromorgan commented 7 years ago

had a quick look at script... and seems strange to be that

pedromorgan commented 7 years ago

@geoffmcl adding issues on crossfeed-logger, re analysis..

Issue #1 is your TODO ie CSV files synced https://github.com/freeflightsim/crossfeed-logger/issues/1

pedromorgan commented 7 years ago

WHAW... I knew I been here before..

https://github.com/freeflightsim/fgxgobot

geoffmcl commented 7 years ago

@pedromorgan yes, I certainly could add some error return to fetch_url, and maybe also to the JSON perl package decode, but haven't yet...

This did start out just as a WIP test to get an updated MODEL use, but has grown since then...

And as mentioned still a TODO to sync the raw CSV files... presently missing 22 and 23, and 24 is truncated... will get to this, hopefully soon ;=))

And I had not forgotten your experimental FG Multi Player Network Bot written in Go WIP of a few years back... But personally, have no interest in replacing FGMS and/or crossfeed C++ programs with Go ;=))