Closed TTTTao725 closed 3 months ago
Hi guys, I changed the commit to make it more readable, please check it out :)
Hi guys, please check out the scripts for infomedia (converting and filtering)
unfiltered, filtered datasets and the list of empty files in the original dataset, all of them are here: dfm-data/pre-training/danews2.0
@TTTTao725 what is the number of tokens without information in infomedia?
6578205491 / 6753608085 tokens after filtering
Why the slash?
that's just the number before filtering :)
so Infomedia is ~175m tokens
Nope, it's 6578205491, we still got a lot after filtering
Sorry for the confusion, I like to write it like a processing bar 😹
Oh I forgot to say, it does not include titles, summaries, etc, just text. So basically I just sum over the 'WordCount' instead of using a tokenizer which would be super slow
So it's actually more than 6578205491 :)
Ah so information is ~175m tokens
@TTTTao725 feel free to merge whenever you are ready
I can't merge it Kenneth, merging is block
just check the box and merge
Prerequisite: # Make sure you have git-lfs installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/datasets/MiMe-MeMo/Corpus-v1.1