Open jeykigung opened 9 months ago
Hi, all the extras will be available in a few weeks along with the stack v2's content
Hi, all the extras will be available in a few weeks along with the stack v2's content
Hi, any updates for the-stack-v2-train-extras
?
Hi, all the extras will be available in a few weeks along with the stack v2's content
Hi, any updates for
the-stack-v2-train-extras
?
@loubnabnl Any updates?
Hi, any updates?
Hi, all the extras will be available in a few weeks along with the stack v2's content
Hi, any updates for
the-stack-v2-train-extras
?@loubnabnl Any updates?
@loubnabnl @bigximik @anton-l @iNeil77 @lvwerra Hi, any updates?
@loubnabnl Hi, any updates?
@loubnabnl Hi, any updates?
Hi, all the extras will be available in a few weeks along with the stack v2's content
Hello, is this still on your release schedule?
HI! aNY UPDATE?11
Still no updates? Bigcode seems not working well these days... I checked the datasets used in training starcoder2. Most were already released before this project, like Arxiv, LHQ, Wiki, etc. One interesting thing is that they have been uploaded but not publicly available. What was really missed was the processed StackOverflow dataset.
Edit: maybe incorrect. I guess this one is the dataset they used: https://huggingface.co/datasets/bigcode/stack-exchange-preferences-20230914-clean-anonymization
Thanks for your wonderful work! In https://huggingface.co/datasets/bigcode/the-stack-v2-dedup, I can only find the-stack-v2-train-smol and the-stack-v2-train-full data. I'm wondering where can I find the the-stack-v2-train-extras and LHQ datasets? Do you have a plan to release it?