Unable to train Burmese tokenizer for training_size = 100,000

bigscience-workshop / multilingual-modeling

BLOOM+1: Adapting BLOOM model to support a new unseen language

https://arxiv.org/abs/2212.09535

Apache License 2.0

69 stars 15 forks source link

Unable to train Burmese tokenizer for training_size = 100,000 #18

Closed yongzx closed 2 years ago

haileyschoelkopf commented 2 years ago

Have you been able to replicate this? I've had it happen before, but just trying again later went much faster and finished quickly.

yongzx commented 2 years ago

I've had it happen before, but just trying again later went much faster and finished quickly.

Which language was this? I am debugging it right now, and have been able to replicate this bug consistently.

haileyschoelkopf commented 2 years ago

This was with Thai. I’ll try with Burmese

haileyschoelkopf commented 2 years ago

I ran this with Burmese, extending vocab, 100k oscar samples, and 24k vocab size.

It hung for a while after finishing counting pairs though. Also I think checked and I think that this script was using 100gb+ (VIRT in top) so maybe your bug is that you run out of CPU memory or RAM.

yongzx commented 2 years ago

Thanks Hailey! Yeap it's memory issue.