Open luoshengtangxiademao opened 10 months ago
Hi!
We also encountered OOM issue while training the tokenizer. To overcome this problem, we sampled 10 x 10^6 random subsequences from the whole dataset to train the tokenizer.
Hi!
We also encountered OOM issue while training the tokenizer. To overcome this problem, we sampled 10 x 10^6 random subsequences from the whole dataset to train the tokenizer.
Hello! I'm wondering how you segmented this complete data set? Are overlaps considered when dividing?
We followed BigBird's data pipeline, so yes, sequences could overlap during sampling from genomic data and during subsampling for tokenization.
Hello,
I have been trying to train a tokenizer following the code you provided, but I am encountering an out-of-memory issue. I'm working with a multi-species dataset that's several tens of GBs in size. Despite having 700GB of memory on my system, the training process for the BPE tokenizer consistently results in an out-of-memory error. Could you please share how you managed to train the BPE vocabulary on such a large multi-species dataset, plus the 1KG data? Any advice or insights would be greatly appreciated!
Thank you for your help and time.