18907305772 / FuseAI

FuseAI Project
https://huggingface.co/FuseAI
76 stars 34 forks source link

minipile_split issue #8

Closed Arbor334 closed 8 months ago

Arbor334 commented 8 months ago

When I change the path in split_long_text.py to my own directory, i got 2024-03-09-09-50-06-image

is it right?

18907305772 commented 8 months ago

Hello, @Arbor334. It's correct to tokenize the long sequence and split it into short chunks.

Arbor334 commented 8 months ago

image thanks @18907305772 ,If the above splitting of the data set is completed, I will implement the different scripts below, corresponding to llama , open_llama mpt, respectively. image

18907305772 commented 8 months ago

That's right. It is recommended to test the code on a small subset before running it on the entire dataset.

Arbor334 commented 8 months ago

Thank you @18907305772 , I will adopt your suggestion