大数据量训练的时候卡住

yzlnew commented 6 months ago

20GB 数据量可以正常训练，100GB 在跑到某一步的时候会卡住。bytepiece==0.6.3 。

某个 thread 的堆栈信息，看不出来，直接问 GPT 似乎是多进程的问题：

#0  0x00007f168f6207a4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f168f620898 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f1683848699 in semlock_acquire ()
   from /opt/rh/rh-python38/root/usr/lib64/python3.8/lib-dynload/_multiprocessing.cpython-38-x86_64-linux-gnu.so
#3  0x00007f168f7ed4e6 in PyCFunction_Call () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#4  0x00007f168f7ac932 in _PyObject_MakeTpCall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#5  0x00007f168f862c5c in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#6  0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#7  0x00007f168f7ab7bd in PyObject_Call () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#8  0x00007f168f860081 in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#9  0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#10 0x00007f168f85e323 in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#11 0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#12 0x00007f168f85e323 in _PyEval_EvalFrameDefault () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#13 0x00007f168f84fe05 in _PyFunction_Vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#14 0x00007f168f8507cb in method_vectorcall () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#15 0x00007f168f7ab7bd in PyObject_Call () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#16 0x00007f168f8ad6d1 in t_bootstrap () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#17 0x00007f168f86bbc4 in pythread_wrapper () from /opt/rh/rh-python38/root/usr/lib64/libpython3.8.so.rh-python38-1.0
#18 0x00007f168f6174e2 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f168f3f25b3 in clone () from /lib64/libc.so.6

bojone commented 6 months ago

系统内存多大呢？以及Trainer的参数是多少？

yzlnew commented 6 months ago

内存是 1.3T，trainer 大致是这样，内存的 peak 一般是在 merge 时候出现么

trainer = Trainer(order=6, max_vocab_size=80000, min_count=32, isolate_digits=True)
trainer.train(corpus_instance, workers=128, batch_size=2000)

Edit 观察到卡住之后这个配置下，used mem 大致是 300GB

FlyCarrot commented 2 weeks ago

内存是 1.3T，trainer 大致是这样，内存的 peak 一般是在 merge 时候出现么
trainer = Trainer(order=6, max_vocab_size=80000, min_count=32, isolate_digits=True)
trainer.train(corpus_instance, workers=128, batch_size=2000)
Edit 观察到卡住之后这个配置下，used mem 大致是 300GB

同有这个现象，测试是200G的wudao数据，内存是1.0T，观察到内存的峰值使用率是100%，似乎是爆内存了 bytePiece版本是commit ID为c50c43ec 输出log如下：

Count Ngrams: 59132213it [4:28:03, 3676.50it/s]                                                                                                                           
Merge Ngrams:  23% 15/64 [27:49<1:31:37, 112.19s/it]                                                                                                                      
Merge Ngrams:  30% 19/64 [35:35<1:26:36, 115.48s/it]                                                                                                                      
Merge Ngrams:  91% 58/64 [1:53:14<12:12, 122.03s/it]                                                                                                                      
Merge Ngrams: 100% 64/64 [2:05:33<00:00, 117.71s/it]                                                                                                                      
Prune Ngrams: 100% 7/7 [07:19<00:00, 62.81s/it]                                                                                                                           
Count Pieces: 5722348it [41:21:27, 3193.81s/it][1]    1991 killed     python train_tokenizer.py

所以应当如何限制一下内存吗？

bojone / bytepiece

大数据量训练的时候卡住 #12