allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
894 stars 90 forks source link

Inquiry about Web Pipeline Availability #151

Open codefly13 opened 4 months ago

codefly13 commented 4 months ago

I hope you are doing well. I came across a reference to the "Web Pipeline" in the paper "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research" and I am very interested in exploring it further. However, it seems that the pipeline is still in preparation. I would like to kindly inquire about the availability of the "Web Pipeline". Is there any information on when it might be released for public use?

dumitrac commented 4 months ago

Hi @codefly13 - all of it is already available in the dolma toolkit (i.e. this repo). Please let me know if you're looking for something different.

OxxoCodes commented 2 months ago

@dumitrac I'm interested in this as well. I'd like to utilize the Dolma toolkit to perform some filtering on CC data (which is what I assume @codefly13 was attempting to perform as well). However, I don't see an example of how to do this in the repo, and the following pipeline is just marked as being WIP: https://github.com/allenai/dolma/tree/main/sources/cc_warc

I'm very new to Dolma so there's a good chance I'm just missing something. Would appreciate some pointers. Thanks!