allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
909 stars 94 forks source link

Provenance license? #108

Closed boxabirds closed 7 months ago

boxabirds commented 7 months ago

Hi I'm researching provenance license/consent risk for clients. The risk being managed is "risk of litigation requiring derivative works such as LLMs to be taken down as a result of copyright violation".

I can't immediately find any resources regarding dolma that address this. I can see some ways that it could be by only crawling content that has a clear statement of the content license (such as Creative Commons).

Apologies if this was made clear somewhere!

šŸ™ in advanceā€¦

soldni commented 7 months ago

The dolma corpus is partially derived from Common Crawl; as such, it is not possible to provide license info about all documents in the dataset.