aws-neuron / neuronx-distributed

MIT No Attribution
30 stars 5 forks source link

Clean up of old checkpoints is crashing #20

Open evellasques opened 2 months ago

evellasques commented 2 months ago

I'm training a model using the PyTorch Lightning plug-in and a limit on the number of kept models:

ModelCheckpoint(
                save_top_k=args.num_kept_checkpoint,
                 monitor="global_step",
                 mode="max",
                 every_n_train_steps=args.checkpoint_freq,
                 dirpath=args.checkpoint_dir,
                 enable_version_counter=False,
             )
         )

The problem is, when the limit defined in save_top_k is reached, PTL will call (at some point) lightning_fabric.plugins.io.torch_io.remove_checkpoint() https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/fabric/plugins/io/torch_io.py#L86. This is recursively removing the files under the oldest saved checkpoint:

fs = get_filesystem(path)
        if fs.exists(path):
            fs.rm(path, recursive=True)
            log.debug(f"Removed checkpoint: {path}")

but then it tries to remove an already removed checkpoint file (I'm using xser), it crashes:

 File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/usr/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'tensor_479.pt'
    self._run(model, ckpt_path=ckpt_path)

As you can notice, more than one process is trying to remove the same file. I think this would be just a matter of running checkpoint removal only at global rank 0 (I'm currently training using 16 nodes, with TP=8 and PP=1).

Here is relevant info about my environment:

pip freeze:

neuronx-cc==2.13.68.0+6dfecc895
neuronx-distributed==0.7.0
torch==1.13.1
torch-neuronx==1.13.1.1.14.0
torch-xla==1.13.1+torchneurone
transformers==4.31.0

Neuron libraries:

aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed]
jyang-aws commented 2 months ago

Thanks for reporting the issue. We're looking at it. So far, it appears this is from PTL torch_io.remove_checkpoint(), but we'll check if anything from neuron side can help.

aws-rhsoln commented 2 months ago

We have recognized the source of the issue. Its mainly coming from this API . Here neuronx_distributed’s CheckpointIO class has not implemented the remove_checkpoint API. Hence it defaults to PyTorch-lightning’s default API which assumes DDP. You are right in the sense that all TP workers would start to delete the same file and we need to re-write the API to ensure that only one worker is deleting at a time. We are looking into it and would have a fix in one of the upcoming releases. To unblock yourself, you can override the API and ensure only one rank (usually 0) deletes the directory whiles others can wait. Sample implementation below:

def remove_checkpoint(self, filepath):
    if xm.get_ordinal() == 0:
        # call delete
     xm.rendezvous('Deleting checkpoint')

Feel free to submit a pull request if you believe this resolves the issue.