Open dementrock opened 4 months ago
Met with the same problem when using 2 nodes 16 GPUs to finetune llama2 model. I use NFS to synchronize dataset files, but it will still cause a FileNotFoundError in the second node, although it already synchronized the cache train dataset files after the first node generated them
Met the same problem when using multiple nodes and moving dataset to shared disk solved the problem.
Met with the same problem when using 2 nodes 16 GPUs to finetune llama2 model. I use NFS to synchronize dataset files, but it will still cause a FileNotFoundError in the second node, although it already synchronized the cache train dataset files after the first node generated them
Set the seed and use NFS to synchronize dataset solve the problem
Marking as stale. No activity in 60 days.
Hi, you can check this simple tutorial to learn how to use NFS to share files between different nodes. https://bluexp.netapp.com/blog/azure-anf-blg-linux-nfs-server-how-to-set-up-server-and-client
I recommend that you use the master node to install nfs-kernel-server, and other nodes to install nfs-client. Creating a dataset directory (for your Megatron-LM training) on the master node, and define access for other nodes in export file, so that you can synchronize training dataset between all your nodes.
By the way, it may not work when the first time you launch training scripts ( it seems like a bug that dist.barrier()
did not work well in new pytorch version, but the master node would successfully generate the train files. So you can launch it again, because the second time it wouldn't generate train files again and start training.)
Nakroy @.***
------------------ 原始邮件 ------------------ 发件人: "NVIDIA/Megatron-LM" @.>; 发送时间: 2024年9月23日(星期一) 下午4:23 @.>; @.**@.>; 主题: Re: [NVIDIA/Megatron-LM] [BUG] GPTDataset._build_document_sample_shuffle_indices does not build the indices on non-root nodes when not using NFS (Issue #907)
Met with the same problem when using 2 nodes 16 GPUs to finetune llama2 model. I use NFS to synchronize dataset files, but it will still cause a FileNotFoundError in the second node, although it already synchronized the cache train dataset files after the first node generated them
Set the seed and use NFS to synchronize dataset solve the problem
Hello, I met the same problem. And I don't understand how to use NFS. Could you please share me this method?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Marking as stale. No activity in 60 days.
Describe the bug If the training data does not live on NFS but on node-specific storage, the current logic in https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/core/datasets/gpt_dataset.py#L346 skips building the indices and result in an error when loading the document index at https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/core/datasets/gpt_dataset.py#L484, complaining that the file does not exist.
To Reproduce Try running multi-node training, pointing to training data not living on NFS.
Expected behavior Ideally there should be a flag indicating whether the data storage is shared file system. If not, the index needs to be built on each node separately.
Stack trace/logs
Environment (please complete the following information):
nvcr.io/nvidia/nemo:24.05.01
2.3.0a0+ebedce2
Proposed fix My workaround is the following patch:
But it does not offer the flexibility of a flag.