MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
44 stars 22 forks source link

sr_watch fails with high sample volumes in flow test #182

Open benlapETS opened 5 years ago

benlapETS commented 5 years ago

SUMMARY

3 cases identified so far:

With changes (1.fixed rm cron jobs, 2.reject enable in no_trouble, 3.walk, 4.stop_instances, 5.fs.inotify max up) made so far:

Reverting changes of this issue (but keeping fs.inotify max up):

Too many things goes wrong I will most certainly abandon the fixes I provided for that issue and restart from scratch focusing on the 3 failures I've identified so far then providing code test to reproduce effectively and provide a way to fix it with the least impact on the code. Flow test is a good integration test, testing if everything is ok but not a good reference to test particular issues because it will not always reproduce the context of a particular issue, this is why I need to redo the whole thing, understanding this problem and expectation, writing tests with outcomes expected and writing code or directive for config to meet expectations. I identified a flaw of the flow test in that issue #206

benlapETS commented 5 years ago

Also caught that warning on a normal run which was also mention in #179:

 1 sr_watch_f40_01.log [WARNING] sr_watch priming watch: /local/ben/sarra_devdocroot/downloaded_by_sub_t failed, deferred.

I'm not sure if it may be related.

Note 2: Don't think it related to the bug but I easily fixed it, we were declaring self.inl as a list while trying to use it as a dict. Don't know what it will change

benlapETS commented 5 years ago

Another try with a 10 000 sample gave me a lot of errors, and I draw the parallel with this case because of this:

2019-05-01 11:25:21,578 [ERROR] Traceback (most recent call last):
2019-05-01 11:25:21,578 [ERROR]   File "/local/ben/anaconda3/envs/base_sarra/bin/sr_watch", line 11, in <module>
2019-05-01 11:25:21,578 [ERROR] load_entry_point('metpx-sarracenia', 'console_scripts', 'sr_watch')()
2019-05-01 11:25:21,578 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_watch.py", line 65, in main
2019-05-01 11:25:21,947 [ERROR] watch.exec_action(action,old)
2019-05-01 11:25:21,947 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_instances.py", line 301, in exec_action
2019-05-01 11:25:23,547 [ERROR] elif action == 'start'      : self.start_parent()
2019-05-01 11:25:23,547 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_instances.py", line 785, in start_parent
2019-05-01 11:25:23,547 [ERROR] self.start()
2019-05-01 11:25:23,547 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_post.py", line 1459, in start
2019-05-01 11:25:23,548 [ERROR] self.run()
2019-05-01 11:25:23,548 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_post.py", line 1446, in run
2019-05-01 11:25:23,548 [ERROR] if self.sleep > 0: self.watch_loop()
2019-05-01 11:25:23,548 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_post.py", line 1264, in watch_loop
2019-05-01 11:25:23,549 [ERROR] self.wakeup()
2019-05-01 11:25:23,549 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_post.py", line 1141, in wakeup
2019-05-01 11:25:23,550 [ERROR] done = self.process_event( event, src, dst )
2019-05-01 11:25:23,550 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_post.py", line 1030, in process_event
2019-05-01 11:25:23,550 [ERROR] lstat = os.stat(src)
2019-05-01 11:25:23,550 [ERROR] FileNotFoundError
2019-05-01 11:25:23,550 [ERROR] :
2019-05-01 11:25:23,550 [ERROR] [Errno 2] No such file or directory: '/local/ben/sarra_devdocroot/downloaded_by_sub_t/20190501/WXO-DD/hydrometric/csv/NL/hourly/NL_03OE013_hourly_hydrometric.csv.tmp'

I think I begin to sense the problem here, we first check if the file exist (which it does so it pass) and then the file get deleted before we got to os.stat. There is a race condition here (and surely elsewhere in sr_watch), it might be the same problem with the base issue. I believe it would be safer to try except in those cases, log the error temporarily and see if we can get a broader solution that would prevent that from happening.

It would be smart to draw the timeline of those many concurrent pipe in details to identify the risks of those race condition situation before we try to fix that for real because :

petersilva commented 5 years ago

go look in sarra/examples/*.inc reenable the rejects of moving targets. There is a whole series of them. These are files which keep the same name and are constantly overwritten, the clean jobs are broken in that they delete old versions of the files... even through new ones are still flowing... I thought we didn´t need them anymore, but apparently we do. See if the runs are now clean...

Another thing to try later is to just exclude the moving targets from the pclean processes. but first lets make sure we have caught the problem with the above change.

petersilva commented 5 years ago
2019-05-02 15:38:38,535 [INFO] sr_watch f40 01 starting
Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00007faa9536d740 (most recent call first):
  File "/usr/lib/python3.6/logging/__init__.py", line 1374 in log
  File "/usr/lib/python3/dist-packages/sarra/sr_config.py", line 85 in write
  File "/usr/lib/python3.6/logging/__init__.py", line 917 in handleError
  File "/usr/lib/python3.6/logging/__init__.py", line 1000 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 1072 in emit
  File "/usr/lib/python3.6/logging/handlers.py", line 73 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 865 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1516 in callHandlers
  File "/usr/lib/python3.6/logging/__init__.py", line 1454 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1444 in _log
  File "/usr/lib/python3.6/logging/__init__.py", line 1374 in log
  File "/usr/lib/python3/dist-packages/sarra/sr_config.py", line 85 in write
  File "/usr/lib/python3.6/logging/__init__.py", line 917 in handleError
  File "/usr/lib/python3.6/logging/__init__.py", line 1000 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 1072 in emit
  File "/usr/lib/python3.6/logging/handlers.py", line 73 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 865 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1516 in callHandlers
  File "/usr/lib/python3.6/logging/__init__.py", line 1454 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1444 in _log
  File "/usr/lib/python3.6/logging/__init__.py", line 1374 in log
  File "/usr/lib/python3/dist-packages/sarra/sr_config.py", line 85 in write
  File "/usr/lib/python3.6/logging/__init__.py", line 917 in handleError
  File "/usr/lib/python3.6/logging/__init__.py", line 1000 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 1072 in emit
  File "/usr/lib/python3.6/logging/handlers.py", line 73 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 865 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1516 in callHandlers
  File "/usr/lib/python3.6/logging/__init__.py", line 1454 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1444 in _log
  File "/usr/lib/python3.6/logging/__init__.py", line 1374 in log
  File "/usr/lib/python3/dist-packages/sarra/sr_config.py", line 85 in write
  File "/usr/lib/python3.6/logging/__init__.py", line 917 in handleError
  File "/usr/lib/python3.6/logging/__init__.py", line 1000 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 1072 in emit
  File "/usr/lib/python3.6/logging/handlers.py", line 73 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 865 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1516 in callHandlers
  File "/usr/lib/python3.6/logging/__init__.py", line 1454 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1444 in _log
  File "/usr/lib/python3.6/logging/__init__.py", line 1374 in log
  File "/usr/lib/python3/dist-packages/sarra/sr_config.py", line 85 in write
  File "/usr/lib/python3.6/logging/__init__.py", line 917 in handleError
  File "/usr/lib/python3.6/logging/__init__.py", line 1000 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 1072 in emit
  File "/usr/lib/python3.6/logging/handlers.py", line 73 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 865 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1516 in callHandlers
  File "/usr/lib/python3.6/logging/__init__.py", line 1454 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1444 in _log
  File "/usr/lib/python3.6/logging/__init__.py", line 1374 in log
  File "/usr/lib/python3/dist-packages/sarra/sr_config.py", line 85 in write
  File "/usr/lib/python3.6/logging/__init__.py", line 917 in handleError
  File "/usr/lib/python3.6/logging/handlers.py", line 75 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 865 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1516 in callHandlers
  File "/usr/lib/python3.6/logging/__init__.py", line 1454 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1444 in _log
  File "/usr/lib/python3.6/logging/__init__.py", line 1374 in log
  File "/usr/lib/python3/dist-packages/sarra/sr_config.py", line 85 in write
  File "/usr/lib/python3.6/logging/__init__.py", line 917 in handleError
  File "/usr/lib/python3.6/logging/__init__.py", line 1000 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 1072 in emit
  File "/usr/lib/python3.6/logging/handlers.py", line 73 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 865 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1516 in callHandlers
  File "/usr/lib/python3.6/logging/__init__.py", line 1454 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1444 in _log
  File "/usr/lib/python3.6/logging/__init__.py", line 1374 in log
  File "/usr/lib/python3/dist-packages/sarra/sr_config.py", line 85 in write
  File "/usr/lib/python3.6/logging/__init__.py", line 917 in handleError
  File "/usr/lib/python3.6/logging/__init__.py", line 1000 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 1072 in emit
  File "/usr/lib/python3.6/logging/handlers.py", line 73 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 865 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1516 in callHandlers
  File "/usr/lib/python3.6/logging/__init__.py", line 1454 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1444 in _log
  File "/usr/lib/python3.6/logging/__init__.py", line 1374 in log
  File "/usr/lib/python3/dist-packages/sarra/sr_config.py", line 85 in write
  File "/usr/lib/python3.6/logging/__init__.py", line 917 in handleError
  File "/usr/lib/python3.6/logging/__init__.py", line 1000 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 1072 in emit
  File "/usr/lib/python3.6/logging/handlers.py", line 73 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 865 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1516 in callHandlers
  File "/usr/lib/python3.6/logging/__init__.py", line 1454 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1444 in _log
  File "/usr/lib/python3.6/logging/__init__.py", line 1374 in log
  File "/usr/lib/python3/dist-packages/sarra/sr_config.py", line 85 in write
  File "/usr/lib/python3.6/logging/__init__.py", line 917 in handleError
  File "/usr/lib/python3.6/logging/__init__.py", line 1000 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 1072 in emit
  File "/usr/lib/python3.6/logging/handlers.py", line 73 in emit
  File "/usr/lib/python3.6/logging/__init__.py", line 865 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1516 in callHandlers
  File "/usr/lib/python3.6/logging/__init__.py", line 1454 in handle
  File "/usr/lib/python3.6/logging/__init__.py", line 1444 in _log
  File "/usr/lib/python3.6/logging/__init__.py", line 1374 in log
  File "/usr/lib/python3/dist-packages/sarra/sr_config.py", line 85 in write
  ...
sanity check: log file /local/home/peter/.cache/sarra/log/sr_cpost_veille_f34_01.log untouched in 598 seconds (too long.)
parent exiting normally, rest is upto the child pid: 29250
config pelle_dd2_f05 is stopped.
petersilva commented 5 years ago

previous entry was printed when... I started flow_check, then killed sr_watch in order to run sr sanity and confirm it got restarted... some time later the crash message above showed up. not sure of relevance.

benlapETS commented 5 years ago

Not sure either that it is relevant to that issue : did you wait for high volume samples ? 10 000 ? 100 000 ? before that happened ?

Is it a normal thing to kill sr_watch in the middle of process ?

The error pointed out looks like a bad recursion because of the StreamToLogging redirection thing implemented for the Windows issue... (#112 ??)

Update: I opened a new bug report #202

petersilva commented 5 years ago

The thing was running for a few hours (because I forgot about) after it had been restarted, that´s why I copy/pasted it, thinking it is a related to the long run. the weird thing is that it printed out to the tty where I ran sr sanity... so that logging not being quite re-directed is still dogging us.

petersilva commented 5 years ago

I still think the first thing to do is to put the rejects of the moving targets back into the include files, and see if things clean up a lot.

benlapETS commented 5 years ago

For the KeyError case, the root cause seems to be the double stop signal that is thrown to sr_watch:

sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,221 [INFO] signal stop (SIGTERM)
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,722 [INFO] signal stop (SIGTERM)

The first stop goes to the lock I have purposely put in the close sequence for testing:

sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,730 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_instances.py", line 1011, in stop_signal
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,731 [ERROR] self.stop()
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,731 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_post.py", line 1468, in stop
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,732 [ERROR] self.close()
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,732 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_post.py", line 159, in close
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,732 [ERROR] self.observer.join()
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,732 [ERROR]   File "/local/ben/anaconda3/envs/base_sarra/lib/python3.7/threading.py", line 1032, in join
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,734 [ERROR] self._wait_for_tstate_lock()
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,734 [ERROR]   File "/local/ben/anaconda3/envs/base_sarra/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,734 [ERROR] elif lock.acquire(block, timeout):

The second stop signal failed because the watch has been handled by the first but the list obs_watched still contains the keys:

sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,734 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_instances.py", line 1011, in stop_signal
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,735 [ERROR] self.stop()
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,735 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_post.py", line 1468, in stop
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,736 [ERROR] self.close()
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,736 [ERROR]   File "/local/ben/git/sarracenia/sarra/sr_post.py", line 157, in close
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,736 [ERROR] self.observer.unschedule(ow)
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,736 [ERROR]   File "/local/ben/anaconda3/envs/base_sarra/lib/python3.7/site-packages/watchdog/observers/api.py", line 343, in unschedule
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,737 [ERROR] emitter = self._emitter_for_watch[watch]
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,737 [ERROR] KeyError
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,737 [ERROR] :
sr_watch_f40_01.log.2019-05-03:2019-05-03 18:52:20,737 [ERROR] <ObservedWatch: path=/local/ben/sarra_devdocroot/downloaded_by_sub_t, is_recursive=True>
benlapETS commented 5 years ago

Finally I think I was off track (and negligent), the real root cause was sarra_devdocroot that was deleted every 40 min from a forgotten cron job. Should we prevent a crash if a user accidentally delete (not as dumbly as me) his Watched folder ?

petersilva commented 5 years ago

Google chaos monkey. You're ahead of the game!

benlapETS commented 5 years ago

I got rid of my chaos monkey noise and it still failed. Here is a view on what is going on:

Michel intuition is that the stop occured in the child which doesn't share the observer of the parent, in which the watch exists.

Then I need to make a new run (a 2h run) to verify that theory and try to answer those fishy behaviours. I will add pid to the logging and run in debug for sr_watch.

petersilva commented 5 years ago

f40 is explicitly set with heartbeat 60, and hb_memory_multiplier 1.02 to make it very likely to have an automatically triggerred restart. Ensuring the restart works would be a good thing. I think if you look at the stop logic, it might be just that it looks, see it hasn´t gone away, and then loops trying to kill it... (sr_instances/stop_instance) ...

benlapETS commented 5 years ago

Here is a summary of what I went through last week, following the questions that I was asking myself:

I think I am close to provide a fix but still without a full understanding of the real problem so I am still very careful.

Here are 2 useful cmd to filter logs relative to that issue:

Cmd to count different errors and warnings:

benlapETS commented 5 years ago

I will investigate on this tomorrow, why is there always 3 child to sr_watch like that: image

Note (May 14): I've seen that the 3 child under sr_watch are managed by watchdog, so for now it seems like a bit out of my scope as I effectively corrected all crashes that were occurring. If we ever see race condition problems occuring in watch because of that we will know where to point at...

benlapETS commented 5 years ago

This diagram is a view on the hb_memory restart trigger sequence on srwatch (which would be similar to other sr and the same of sr_post). It is INCOMPLETE for now. image

benlapETS commented 5 years ago

Here is a view on how SIGTERM may be managed (still INCOMPLETE diagram) image

benlapETS commented 5 years ago
benlapETS commented 5 years ago

Note1: New case identified, hb_memory will restart watch because memory limit reached, but at restart it reads post_total_save .vars but it is empty, so it cannot recover anything. Added a check in the plugin that will prevent that and log the error for now. This will not really fix it. Note2: Cases in this issue are now handle properly. I will try to merge fixes from this branch. Need to wait for the result of the pull request tests, but also running some tests on local machine.