Current progress of dataset creation

orionw commented 3 months ago

Changes Making an initial PR, cc @bclavie @warner-benjamin

The initial scripts work, and the later scripts I think work, but I've only tested them on small scale data so I'm not 100% sure. The readme has more technical details, but here is the plan: create MDS datasets from HF and then sample them into our dataset.

This assumes a few important things:

That someone on the modeling side knows how to set a stop at a given number of tokens threshold @NohTow @warner-benjamin These will be close to 20B tokens but not quite since it's an instance level approximation.
Do we need them tokenized? That makes it much slower and much more disk intensive. I was planning to upload text-only for now unless we can get other machines

Tests

I don't think we have data side tests yet so ignoring.

orionw commented 3 months ago

Now that someone is using the GPUs (which is great, but also uses RAM/CPUs) the source stats calculation is significantly slower. Estimated to take the full day and then I'll start sampling tomorrow.

NohTow commented 3 months ago

That someone on the modeling side knows how to set a stop at a given number of tokens threshold @NohTow @warner-benjamin These will be close to 20B tokens but not quite since it's an instance level approximation.

Yes we can set the number of tokens in the config. Adapting the token counting function to better account for padding as suggested by @warner-benjamin would be helpful.

Do we need them tokenized? That makes it much slower and much more disk intensive. I was planning to upload text-only for now unless we can get other machines

Data ablations are cheap so we can go with the raw text and only do the pre-tokenization for the big runs.

orionw commented 3 months ago

Per our conversation @warner-benjamin, I think this completes this work-in-progress. We can now sample any dataset from huggingface for our dataset.

AnswerDotAI / bert24

Current progress of dataset creation #72