Open evellasques opened 2 months ago
Thanks for reporting the issue. We're looking at it. So far, it appears this is from PTL torch_io.remove_checkpoint(), but we'll check if anything from neuron side can help.
We have recognized the source of the issue. Its mainly coming from this API . Here neuronx_distributed’s CheckpointIO class has not implemented the remove_checkpoint API. Hence it defaults to PyTorch-lightning’s default API which assumes DDP. You are right in the sense that all TP workers would start to delete the same file and we need to re-write the API to ensure that only one worker is deleting at a time. We are looking into it and would have a fix in one of the upcoming releases. To unblock yourself, you can override the API and ensure only one rank (usually 0) deletes the directory whiles others can wait. Sample implementation below:
def remove_checkpoint(self, filepath):
if xm.get_ordinal() == 0:
# call delete
xm.rendezvous('Deleting checkpoint')
Feel free to submit a pull request if you believe this resolves the issue.
I'm training a model using the PyTorch Lightning plug-in and a limit on the number of kept models:
The problem is, when the limit defined in
save_top_k
is reached, PTL will call (at some point)lightning_fabric.plugins.io.torch_io.remove_checkpoint()
https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/fabric/plugins/io/torch_io.py#L86. This is recursively removing the files under the oldest saved checkpoint:but then it tries to remove an already removed checkpoint file (I'm using xser), it crashes:
As you can notice, more than one process is trying to remove the same file. I think this would be just a matter of running checkpoint removal only at global rank 0 (I'm currently training using 16 nodes, with TP=8 and PP=1).
Here is relevant info about my environment:
pip freeze:
Neuron libraries: