EleutherAI / pythia

The hub for EleutherAI's work on interpretability and learning dynamics
Apache License 2.0
2.16k stars 156 forks source link

Deduplicated Pile dataset with Domain Attribution #137

Closed michaelduan8 closed 8 months ago

michaelduan8 commented 8 months ago

Hi there!

I was wondering if there was a way to reproduce this dataset with domain attribution (determining which Pile subdomain a given document comes from) or if the existing dataset at that link could be updated with domain metadata?

Thanks!