---------- Forwarded message ---------
From: Nahorniak, Jasmine
Date: Fri, Nov 22, 2024 at 3:20 PM
Subject: Possible logger issue solution
To: David Pablo Cohn, King, Tristan
Pablo,
Regarding the issue we encountered where the loggers were failing to start (sometimes none would start, sometimes just a few would start) ... we think we found the cause and a possible solution. I have cc’d Tristan King (our newest team member) who helped diagnose the issue with me.
So far it looks like the solution has fixed some if not all the other recent openrvdas issues were having too (database lock errors, import error, etc.). We wanted to run the theory and solution by you just in case there are unexpected consequences to the solution we came up with.
Background:
We run logger_manager as a supervisor service using the following call (notice the use of the config argument):
After a recent update to OpenRVDAS (we updated from Dec 2023), we were unable to get most of the loggers to start. Sometimes none would start. Sometimes just one or two. There weren’t any obvious errors in the logs. The logger manager was failing to even try to start the next logger.
We added a slew of debug statements to the logger_manager.py and django_server_api.py files to determine where exactly the code was hanging, our suspicion being that it was hung on a lock.
The order of operations on logger_manager startup seemed to be:
Logger_manager starts 3 threads
Logger_manager checks for args.config, and if it exists then it tries to upload the cruise config (load_configuration in django_server_api.py); it doesn’t use a config_lock but it does acquire a database lock
At the same time, one of the threads from step one (_send_cruise_definition_loop) is also trying to manage the cruise config (_load_new_definition_from_api); it gets a config_lock but has to wait for the database lock because load_configuration has it
load_configuration gets to a point in the code where it needs a config_lock (set_active_logger_config > signal_update() > update_configs) but it can’t get one because send_cruise_definition has it
So the two processes are in a deadlock.
The solution we found was to simply move the line “logger_manager.start()” in logger_manager.py to after the “if args.config” section. That way the config is fully loaded before the threads are started.
We believe it is the use of the “--config” argument that exposed the problems. If this argument isn’t commonly used that may explain why this hasn’t come up before.
Does this all make sense? If not, let me know and we can share more of our debugging findings with you.
From Jasmine Nahorniak:
---------- Forwarded message --------- From: Nahorniak, Jasmine Date: Fri, Nov 22, 2024 at 3:20 PM Subject: Possible logger issue solution To: David Pablo Cohn, King, Tristan
Pablo,
Regarding the issue we encountered where the loggers were failing to start (sometimes none would start, sometimes just a few would start) ... we think we found the cause and a possible solution. I have cc’d Tristan King (our newest team member) who helped diagnose the issue with me.
So far it looks like the solution has fixed some if not all the other recent openrvdas issues were having too (database lock errors, import error, etc.). We wanted to run the theory and solution by you just in case there are unexpected consequences to the solution we came up with.
Background:
We run logger_manager as a supervisor service using the following call (notice the use of the config argument):
/opt/openrvdas/venv/bin/python server/logger_manager.py --database django --data_server_websocket :8766 -v -V --no-console --config ./local/coriolix/cruise.yaml
After a recent update to OpenRVDAS (we updated from Dec 2023), we were unable to get most of the loggers to start. Sometimes none would start. Sometimes just one or two. There weren’t any obvious errors in the logs. The logger manager was failing to even try to start the next logger.
We added a slew of debug statements to the logger_manager.py and django_server_api.py files to determine where exactly the code was hanging, our suspicion being that it was hung on a lock.
The order of operations on logger_manager startup seemed to be:
Logger_manager starts 3 threads
Logger_manager checks for args.config, and if it exists then it tries to upload the cruise config (load_configuration in django_server_api.py); it doesn’t use a config_lock but it does acquire a database lock
At the same time, one of the threads from step one (_send_cruise_definition_loop) is also trying to manage the cruise config (_load_new_definition_from_api); it gets a config_lock but has to wait for the database lock because load_configuration has it
load_configuration gets to a point in the code where it needs a config_lock (set_active_logger_config > signal_update() > update_configs) but it can’t get one because send_cruise_definition has it
So the two processes are in a deadlock.
The solution we found was to simply move the line “logger_manager.start()” in logger_manager.py to after the “if args.config” section. That way the config is fully loaded before the threads are started.
We believe it is the use of the “--config” argument that exposed the problems. If this argument isn’t commonly used that may explain why this hasn’t come up before.
Does this all make sense? If not, let me know and we can share more of our debugging findings with you.
Thanks!!!
Jas.