Out of Memory Issue During BPE Tokenizer Training with Large Multi-Species Dataset

AIRI-Institute / GENA_LM

GENA-LM is a transformer masked language model trained on human DNA sequence.

https://www.biorxiv.org/content/10.1101/2023.06.12.544594

MIT License

170 stars 17 forks source link

Out of Memory Issue During BPE Tokenizer Training with Large Multi-Species Dataset #14

Open luoshengtangxiademao opened 10 months ago

luoshengtangxiademao commented 10 months ago

Hello,

I have been trying to train a tokenizer following the code you provided, but I am encountering an out-of-memory issue. I'm working with a multi-species dataset that's several tens of GBs in size. Despite having 700GB of memory on my system, the training process for the BPE tokenizer consistently results in an out-of-memory error. Could you please share how you managed to train the BPE vocabulary on such a large multi-species dataset, plus the 1KG data? Any advice or insights would be greatly appreciated!

Thank you for your help and time.

yurakuratov commented 8 months ago

Hi!

We also encountered OOM issue while training the tokenizer. To overcome this problem, we sampled 10 x 10^6 random subsequences from the whole dataset to train the tokenizer.

a-green-hand-jack commented 7 months ago

Hi!

We also encountered OOM issue while training the tokenizer. To overcome this problem, we sampled 10 x 10^6 random subsequences from the whole dataset to train the tokenizer.

Hello! I'm wondering how you segmented this complete data set? Are overlaps considered when dividing?

yurakuratov commented 2 weeks ago

We followed BigBird's data pipeline, so yes, sequences could overlap during sampling from genomic data and during subsampling for tokenization.