Open Yijia-Xiao opened 2 years ago
Hello, has this issue been resolved?
This issue will be addressed in the next few days by an update to preprocess_data.py
that allows processing a large dataset in multiple partitions and thereby avoiding OOM errors. I'll update this issue when the update hits.
Marking as stale. No activity in 60 days.
Qual porcentagens já está para finalizar o projeto?
Pode concluir tá autorizado
Marking as stale. No activity in 60 days.
Hi, thank you for your great work! I've been using Megatron-LM for some time, and I've encountered some problems in building a large dataset. I used preprocess_data.py to build a
jsonl
(about 1TB) to .bin and .idx file; the server comes with 504GB memory. But unfortunately, when the *.bin grows to about 600GB, the process seems to be dead. I wonder if there are some solution for big corpus, or will the lazy loader works?Thank you:)