Open xuxiaoang opened 4 months ago
Hi, is your training data very small? Maybe you can use a larger training data?
Hi, thank you for your reply.
I change the dataset to the whole metadata.jsonl
in part1 of AnyInstruct dataset, but there are still issues.
When I was debugging, I found that the preprocess
method in anygpt/src/train/stage2_sft.py
would MASK all tokens in targets
by IGNORE_TOKEN_ID and return them as labels
, as shown below:
I noticed that the comment on line 248 of the source code: Mask targets. Only compute loss on the assistant outputs
. Does this mean that anygpt_system_prompt
part and user_massage
part need to be masked and only the "anygpt_massage" part should remain? I personally think that there are some minor bugs at the part of masking the tokens in preprocess
method.
By the way, could you explain why the user_massage part needs to be masked? Is this based on rules or experience? What happens if the user_massage part is not masked?
Looking forward to your reply.
Thanks.
Does this mean that anygpt_system_prompt part and user_massage part need to be masked and only the "anygpt_massage" part should remain?
yes, it is.
I think this code seems to work fine on my data. Ideally, except for the part of the model response, the targets corresponding to other tokens will be set to -100, which means no loss is calculated. We do this because it seems to be a common practice for fine-tuning instructions, but we actually tried not to do this and directly calculate the loss on the entire sequence, and I don’t think there is much difference
Hello! Thank you for your work at MLLM. I had a fine-tuning bug that I couldn't fix: when I ran the
stage2_sft.sh
script and trained with speech_conv_datasets only, the logger showed that the train loss was 0 all the time and eval loss was NaN, as shown in the figure.Command in
stage2_sft.sh
as follows:I'm using the following python environment: