How is the dataset ued to train?

albertan017 / LLM4Decompile

Reverse Engineering: Decompiling Binary Code with Large Language Models

https://arxiv.org/abs/2403.05286

MIT License

3.19k stars 233 forks source link

How is the dataset ued to train? #23

Open Pisces032 opened 3 months ago

Pisces032 commented 3 months ago

I'm trying to use PEFT to improve the model. I wonder how AnghaBench_compile.jsonl is used to train. i noticed declare -a dataset=( "path_to_llm4decompile_data/arrow/part-00000" ) in run_llm4decompile_train.sh, but i can't make out the training process. Maybe colossalai format hides some details about the model or the training process? Thank you so much!

rocky-lq commented 3 months ago

Thanks for your interest in our project, we've updated the guidance for preparing Colossal AI training data. Please refer to Prepare the data.

Additionally, we recommend using LLaMA Factory to train the llm4decompile model, as it is more user-friendly. For more details, please visit LLaMA-Factory.

Pisces032 commented 3 months ago

Thank you!