Closed learnercat closed 4 months ago
Thanks for the interest!
The easiest way is to pass the flag --mono_data_path $YOUR_MONO_DATA
to the training command, e.g.,
accelerate launch --config_file configs/deepspeed_train_config.yaml \
run_llmmt.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--mono_data_path $YOUR_MONO_DATA \
--max_steps 600000 \
--do_train \
--low_cpu_mem_usage \
--fp16 \
--learning_rate 2e-5 \
--weight_decay 0.01 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--warmup_ratio 0.01 \
--ignore_pad_token_for_loss \
--ignore_prompt_token_for_loss \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--save_strategy steps \
--save_steps 2000 \
--save_total_limit 1 \
--logging_strategy steps \
--logging_steps 1 \
--output_dir ${OUTPUT_DIR} \
--max_new_tokens 256 \
--max_source_length 256 \
--seed 42 \
--overwrite_output_dir \
--report_to none
The format of your mono data should follow the translation parallell data, but leave the English sentence empty, e.g.,
{
"translation":
{
"de": "mono sentence",
"en": "",
}
}
Thank you very much for the complete guide for Monolingual training.
Thank you for your great work. I am preparing to start an experiment with the ALMA for low-resource languages. I want to use and fine-tune the monolingual model with my local datasets in stage 1 instead of OSCAR dataset. Should I follow the OSCAR JSON file format as below? Please advise.
-km
train_raw.json
valid_raw.json
test_raw.json
-th
-ja
[{‘id': 0, 'text': “TEXT”, “lang”: “LANG”}, {‘id': 0, 'text': “TEXT”, “lang”: “LANG”}, {‘id': 0, 'text': “TEXT”, “lang”: “LANG”}, …. ]