allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
894 stars 90 forks source link

Need help on accessing the raw reddit data #168

Closed Jianxin-MNM closed 2 months ago

Jianxin-MNM commented 3 months ago

Hi,

The dolma is really a fantastic work. I am currently trying to extend the data pipeline to more languages with the reddit data. Would any one help with:

  1. share workable link / access method to the raw reddit dataset?
  2. I have found some torrent links with the .zst file from multi archives, would anyone could help to share a sha256sum so that I can valid my downloading is working correctly?

Cheers!

soldni commented 2 months ago

Apologies, but we are not planning to share the raw reddit dataset.