Purpose for the Split long text step

ZLKong commented 5 months ago

Hi,

Thank you for sharing your work! I have a question. What is the motivation or the reason behind the first step?

Split long text

python ./src/utils/split_long_text.py \
--base_model_name_or_path "<path_to_llama_2_7b>" \
--blending_model_name_or_path "<path_to_open_llama_7b_v2>" \
--another_blending_model_name_or_path "<path_to_mpt_7b>" \
--dataset "<path_to_minipile>" \
--dataset_save_dir "<path_to_minipile_split>" \
--cache_dir "<path_to_cache_dir>" \
--block_size 2048 \
--preprocessing_num_workers 80

Why is it necessary to load the three models when splitting the dataset? This part is not mentioned in the paper. Could you please provide some references?

Additionally, is it required to start from the first step when fusing with a new model?

Thanks!

18907305772 commented 5 months ago

Since different LLMs have different tokenizers, the lengths of the tokenized sequences are different. In this Python script, for each text in the training corpus, we use the tokenized sequence with the maximum lengths to split long text.

If you need to fuse a new model using FuseLLM, it is recommended to start from the first step. To facilitate a more adaptable fusion of a new model, we have updated the method in FuseChat (this method can be used to fuse Foundation LLMs, not only Chat LLMs). Please refer to the paper or the README.md of FuseChat for more details.

Thank you!

ZLKong commented 5 months ago

Thank you very much!

18907305772 / FuseAI

Purpose for the Split long text step #12