Open dushyantbehl opened 4 months ago
We faced this issue when running our benchmarking runs. That is we had hundreds of AIM runs appearing active
due to NCCL and GPU Out of Memory exceptions not closing the AIM experiments. This large number of active experiments eventually caused the web dashboard to stop working.
To fix this we basically did what you suggest in the issue description. Only instead of using atexit()
we caught exceptions inside a wrapper file that invokes tuning.sft_trainer::train()
. In our exception handler we manually invoked the close()
method on the AIM run.
Is your feature request related to a problem? Please describe.
Currently when the fine tuning script crashes a lot of state associated with it is gone and leave some open file descriptors or connections which are not closed, for e.g. runs tracked by Aim which show as running even though the program has exited.
The proposal is to have an exit handler which will run
close
on these descriptors and even allow to save some state from the system before exiting.Describe the solution you'd like
Need to look into what helps here, modules like https://docs.python.org/3/library/atexit.html exist but only help for cetain scenarios and not all of them.