OceanDataTools / openrvdas

An open source data acquisition system designed for use on research vessels and other scientific installations
http://openrvdas.org
Other
45 stars 21 forks source link

CORIOLIX calling sequence causes logger_manager deadlock #404

Closed davidpablocohn closed 1 week ago

davidpablocohn commented 1 week ago

From Jasmine Nahorniak:

---------- Forwarded message --------- From: Nahorniak, Jasmine Date: Fri, Nov 22, 2024 at 3:20 PM Subject: Possible logger issue solution To: David Pablo Cohn, King, Tristan

Pablo,

Regarding the issue we encountered where the loggers were failing to start (sometimes none would start, sometimes just a few would start) ... we think we found the cause and a possible solution. I have cc’d Tristan King (our newest team member) who helped diagnose the issue with me.

So far it looks like the solution has fixed some if not all the other recent openrvdas issues were having too (database lock errors, import error, etc.). We wanted to run the theory and solution by you just in case there are unexpected consequences to the solution we came up with.

Background:

We run logger_manager as a supervisor service using the following call (notice the use of the config argument):

/opt/openrvdas/venv/bin/python server/logger_manager.py --database django --data_server_websocket :8766 -v -V --no-console --config ./local/coriolix/cruise.yaml

After a recent update to OpenRVDAS (we updated from Dec 2023), we were unable to get most of the loggers to start. Sometimes none would start. Sometimes just one or two. There weren’t any obvious errors in the logs. The logger manager was failing to even try to start the next logger.

We added a slew of debug statements to the logger_manager.py and django_server_api.py files to determine where exactly the code was hanging, our suspicion being that it was hung on a lock.

The order of operations on logger_manager startup seemed to be:

  1. Logger_manager starts 3 threads

  2. Logger_manager checks for args.config, and if it exists then it tries to upload the cruise config (load_configuration in django_server_api.py); it doesn’t use a config_lock but it does acquire a database lock

  3. At the same time, one of the threads from step one (_send_cruise_definition_loop) is also trying to manage the cruise config (_load_new_definition_from_api); it gets a config_lock but has to wait for the database lock because load_configuration has it

  4. load_configuration gets to a point in the code where it needs a config_lock (set_active_logger_config > signal_update() > update_configs) but it can’t get one because send_cruise_definition has it

So the two processes are in a deadlock.

The solution we found was to simply move the line “logger_manager.start()” in logger_manager.py to after the “if args.config” section. That way the config is fully loaded before the threads are started.

We believe it is the use of the “--config” argument that exposed the problems. If this argument isn’t commonly used that may explain why this hasn’t come up before.

Does this all make sense? If not, let me know and we can share more of our debugging findings with you.

Thanks!!!

Jas.