Closed jbweston closed 7 months ago
OK, I wrote out a diagram for the database entries' finite state machine, and I have a working hypothesis for what is happening.
It is possible for this condition to trigger when the following conditions are met:
The key insight is that there is a race condition between the database manager processing the loss of a job from the queue, and the 'stop request' coming from the worker in the job that just stopped.
Strong evidence for this hypothesis would be if an entry for the job appeared in 'database_manager.failed', but the corresponding entry in the DB is marked 'done' (assuming we kill everything quickly after the exception is raised)
Glad you’re looking into this! I might have seen this issue pop up once or twice. Are you saying that you can consistently reproduce this problem using more than 100 learners?
Given your hypothesis I imagine that the problem would occur more frequently when all learners can load their data and instantly be done.
You could create 1000 SequenceLearners with a sequence of length 1 and set the JobManager’s interval to 1 for example.
the job is successful and disappears from the job queue
I believe the launcher.py blocks on this step when calling tell_done
Glad you’re looking into this! I might have seen this issue pop up once or twice. Are you saying that you can consistently reproduce this problem using more than 100 learners?
I cannot consistently reproduce the problem, but it seems to occur more frequently with more learners. Another thing that confuses me is that the notebook gets flooded with log messages from the event; the code tells me that this should happen at most once every 5 seconds.
the job is successful and disappears from the job queue
I believe the launcher.py blocks on this step when calling tell_done
Yes, after looking a bit more it seems that the database should have been updated before the worker is allowed to die.
After a lot of debugging I managed to zero in on the problem.
The issue occurs when a node dies before it registers itself with the adaptive-scheduler database manager.
When this happens the database entries are in the state (is_pending=True, job_id=None, job_name="something"), and the jobs have disappeared from the queue.
adaptive-scheduler does not update the database in response to this because the selection logic[1] is faulty; It looks for entries based on 'job_id', which is only set once the worker registers. OTOH the logic for selecting when to start jobs2: uses the job names, which are set as soon as the job is launched.
The fix is to modify the database update logic[1] to select based on job name rather than job id.
A couple more notes that might make these kinds of problems easier to debug in the future:
In order to get enough information for debugging I had to modify adaptive-scheduler with additional debugging information, e.g.: https://github.com/basnijholt/adaptive-scheduler/blob/698ecf11aac138124c4dc442d57b6fb3aef78172/adaptive_scheduler/_server_support/database_manager.py#L218
I then had to ask my users to pull down and use this modified code (as I was unable to reproduce the error myself).
Instead, we could use trace-level logging to record this rich information (I saved method name, queue state and database state), leveraging our existing structlog
dependency. Getting useful traces from users would then be a case of setting the logLevel appropriately.
We could improve on my ad-hoc approach my also recording which asyncio task the method is called from (this caused me some additional confusion during the debugging process; as at least 3 tasks are touching the database).
This bug appeared because there was a mismatch between filtering on job_id vs job_name. However, it is redundant to store both of these pieces of information, as we enforce that there may only be at most one job with a given job name alive at a time.
We could consider removing 'job_id' from the DB entries in order to make this kind of programming error less likely going forward.
Both solutions seem good, however, adding more logging is certainly a good idea regardless.
Nice job on getting to the bottom of this!
Thanks a lot for the fix! So you know, I'm now getting an odd formatting issue in the runner widget:
When I launch an adaptive scheduler run with ~100 learners or more, I get a RuntimeError after some time, which is triggered by this line: https://github.com/basnijholt/adaptive-scheduler/blob/e1702e680f18bf19d52ebb37a101d59dfca027b4/adaptive_scheduler/_server_support/database_manager.py#L261
This happens with the latest adaptive-scheduler release, and previous releases back to 0.2.3 (when this code block was introduced).
The run manager continues to run (the exception is logged and swallowed before it bubbles to the top), but the notebook output gets filled to the point that the notebook crashes.
This condition is triggered when adaptive scheduler starts new jobs (because it sees that there are still learners to do), but when it actually tries to get an fname out of the database that meets the necessary conditions:https://github.com/basnijholt/adaptive-scheduler/blob/e1702e680f18bf19d52ebb37a101d59dfca027b4/adaptive_scheduler/_server_support/database_manager.py#L257 it cannot find one.
Even more vexing: if I kill the run manager as soon as this condition is detected and inspect the state of the database I can run the 'get' and I do get an fname back. To me this indicates that there is some kind of race condition happening.