centre-for-humanities-computing / danish-foundation-models

A project for training foundational Danish language model
https://foundationmodels.dk
MIT License
68 stars 4 forks source link

Add scripts for converting memo #245

Closed TTTTao725 closed 3 months ago

TTTTao725 commented 4 months ago

Prerequisite: # Make sure you have git-lfs installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/datasets/MiMe-MeMo/Corpus-v1.1

TTTTao725 commented 4 months ago

Hi guys, I changed the commit to make it more readable, please check it out :)

TTTTao725 commented 3 months ago

Hi guys, please check out the scripts for infomedia (converting and filtering)

unfiltered, filtered datasets and the list of empty files in the original dataset, all of them are here: dfm-data/pre-training/danews2.0

KennethEnevoldsen commented 3 months ago

@TTTTao725 what is the number of tokens without information in infomedia?

TTTTao725 commented 3 months ago

6578205491 / 6753608085 tokens after filtering

KennethEnevoldsen commented 3 months ago

Why the slash?

TTTTao725 commented 3 months ago

that's just the number before filtering :)

KennethEnevoldsen commented 3 months ago

so Infomedia is ~175m tokens

TTTTao725 commented 3 months ago

Nope, it's 6578205491, we still got a lot after filtering

TTTTao725 commented 3 months ago

Sorry for the confusion, I like to write it like a processing bar 😹

TTTTao725 commented 3 months ago

Oh I forgot to say, it does not include titles, summaries, etc, just text. So basically I just sum over the 'WordCount' instead of using a tokenizer which would be super slow

TTTTao725 commented 3 months ago

So it's actually more than 6578205491 :)

KennethEnevoldsen commented 3 months ago

Ah so information is ~175m tokens

KennethEnevoldsen commented 3 months ago

@TTTTao725 feel free to merge whenever you are ready

TTTTao725 commented 3 months ago

I can't merge it Kenneth, merging is block image

KennethEnevoldsen commented 3 months ago

just check the box and merge