NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.47k stars 2.14k forks source link

About building *.bin and *.idx #154

Open Yijia-Xiao opened 2 years ago

Yijia-Xiao commented 2 years ago

Hi, thank you for your great work! I've been using Megatron-LM for some time, and I've encountered some problems in building a large dataset. I used preprocess_data.py to build a jsonl (about 1TB) to .bin and .idx file; the server comes with 504GB memory. But unfortunately, when the *.bin grows to about 600GB, the process seems to be dead. I wonder if there are some solution for big corpus, or will the lazy loader works?

Thank you:)

Ant0082 commented 1 year ago

Hello, has this issue been resolved?

jon-barker commented 1 year ago

This issue will be addressed in the next few days by an update to preprocess_data.py that allows processing a large dataset in multiple partitions and thereby avoiding OOM errors. I'll update this issue when the update hits.

github-actions[bot] commented 10 months ago

Marking as stale. No activity in 60 days.

felipeliliti commented 2 months ago

Qual porcentagens já está para finalizar o projeto?

felipeliliti commented 2 months ago

Pode concluir tá autorizado

github-actions[bot] commented 1 week ago

Marking as stale. No activity in 60 days.