foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
Apache License 2.0
28 stars 48 forks source link

feat: Need a way to execute some cleanup calls before the program exits or crashes. #271

Open dushyantbehl opened 4 months ago

dushyantbehl commented 4 months ago

Is your feature request related to a problem? Please describe.

Currently when the fine tuning script crashes a lot of state associated with it is gone and leave some open file descriptors or connections which are not closed, for e.g. runs tracked by Aim which show as running even though the program has exited.

image

The proposal is to have an exit handler which will run close on these descriptors and even allow to save some state from the system before exiting.

Describe the solution you'd like

Need to look into what helps here, modules like https://docs.python.org/3/library/atexit.html exist but only help for cetain scenarios and not all of them.

VassilisVassiliadis commented 1 week ago

We faced this issue when running our benchmarking runs. That is we had hundreds of AIM runs appearing active due to NCCL and GPU Out of Memory exceptions not closing the AIM experiments. This large number of active experiments eventually caused the web dashboard to stop working.

To fix this we basically did what you suggest in the issue description. Only instead of using atexit() we caught exceptions inside a wrapper file that invokes tuning.sft_trainer::train(). In our exception handler we manually invoked the close() method on the AIM run.