minipile_split issue - Githubissues

18907305772 / FuseAI

FuseAI Project

https://huggingface.co/FuseAI

75 stars 33 forks source link

minipile_split issue #8

Closed Arbor334 closed 6 months ago

Arbor334 commented 6 months ago

When I change the path in split_long_text.py to my own directory, i got 2024-03-09-09-50-06-image

is it right？

18907305772 commented 6 months ago

Hello, @Arbor334. It's correct to tokenize the long sequence and split it into short chunks.

Arbor334 commented 6 months ago

thanks @18907305772 ,If the above splitting of the data set is completed, I will implement the different scripts below, corresponding to llama , open_llama mpt, respectively.

18907305772 commented 6 months ago

That's right. It is recommended to test the code on a small subset before running it on the entire dataset.

Arbor334 commented 6 months ago

Thank you @18907305772 , I will adopt your suggestion