bigcode-project / the-stack-v2

Code for the curation of The Stack v2 and StarCoder2 training data
Apache License 2.0
91 stars 6 forks source link

Where can I find the the-stack-v2-train-extras and LHQ datasets? #7

Open jeykigung opened 9 months ago

jeykigung commented 9 months ago

Thanks for your wonderful work! In https://huggingface.co/datasets/bigcode/the-stack-v2-dedup, I can only find the-stack-v2-train-smol and the-stack-v2-train-full data. I'm wondering where can I find the the-stack-v2-train-extras and LHQ datasets? Do you have a plan to release it?

loubnabnl commented 9 months ago

Hi, all the extras will be available in a few weeks along with the stack v2's content

ShaneTian commented 7 months ago

Hi, all the extras will be available in a few weeks along with the stack v2's content

Hi, any updates for the-stack-v2-train-extras?

ShaneTian commented 7 months ago

Hi, all the extras will be available in a few weeks along with the stack v2's content

Hi, any updates for the-stack-v2-train-extras?

@loubnabnl Any updates?

Casi11as commented 6 months ago

Hi, any updates?

ShaneTian commented 6 months ago

Hi, all the extras will be available in a few weeks along with the stack v2's content

Hi, any updates for the-stack-v2-train-extras?

@loubnabnl Any updates?

@loubnabnl @bigximik @anton-l @iNeil77 @lvwerra Hi, any updates?

noforit commented 5 months ago

@loubnabnl Hi, any updates?

fghccv commented 5 months ago

@loubnabnl Hi, any updates?

twelveand0 commented 4 months ago

Hi, all the extras will be available in a few weeks along with the stack v2's content

Hello, is this still on your release schedule?

takiholadi commented 2 months ago

HI! aNY UPDATE?11

yucc-leon commented 3 weeks ago

Still no updates? Bigcode seems not working well these days... I checked the datasets used in training starcoder2. Most were already released before this project, like Arxiv, LHQ, Wiki, etc. One interesting thing is that they have been uploaded but not publicly available. What was really missed was the processed StackOverflow dataset.

Edit: maybe incorrect. I guess this one is the dataset they used: https://huggingface.co/datasets/bigcode/stack-exchange-preferences-20230914-clean-anonymization