bigcode-project / Megatron-LM

Ongoing research training transformer models at scale
Other
371 stars 48 forks source link

Create the Stack 1.2 dataset #24

Closed RaymondLi0 closed 1 year ago

RaymondLi0 commented 1 year ago
RaymondLi0 commented 1 year ago

dedup dataset: https://huggingface.co/datasets/bigcode/the-stack-dedup filtered and decontaminated: https://huggingface.co/datasets/bigcode/the-stack-march

To which we add: