Closed Arbor334 closed 8 months ago
Hello, @Arbor334. It's correct to tokenize the long sequence and split it into short chunks.
thanks @18907305772 ,If the above splitting of the data set is completed, I will implement the different scripts below, corresponding to llama
, open_llama
mpt
, respectively.
That's right. It is recommended to test the code on a small subset before running it on the entire dataset.
Thank you @18907305772 , I will adopt your suggestion
When I change the path in split_long_text.py to my own directory, i got
is it right?