Closed gaoyixuan111 closed 5 months ago
Hi, @gaoyixuan111 , background_loss is designed to prevent background from interfering with text instructions during training. It is actually another form of predict_loss (but the background is removed). The only losses used are facial_loss and predict_loss (Line 270). Background loss is for debugging convenience and is not actually used. We have explained in the Experimental Implements section of the paper that the model will remove background information with a probability of 50%.
@JackAILab After following the correct steps to prepare the dataset and running the training code train.py, I encountered the following error. Can you provide some suggestions? I'm not sure if this warning is normal or not. Will it affect the performance of the model? “Epoch 1/100: 0%| | 0/3 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (106 > 77). Running this sequence through the model will result in indexing errors”
@gaoyixuan111 hi, In some cases, llava will generate too long descriptions. In actual testing, only a small percentage of facial captions are generated, so you can consider discarding this part of facial captions directly.
We have updated the utils.py Line102 file to prevent errors caused by too long facial text.
@JackAILab
Thank you for your update. I just ran the updated train.py
, and I found that the error still persists. Could you provide some other modification methods? What impact would ignoring this error have on the model?
I modified max_text_length
to 77, and the warning disappeared. However, I am not sure if this modification is correct.
@gaoyixuan111 Just ignore the indexing errors warning, which is mainly caused by the position of the trigger word. Running according to the current code logic will not actually cause this error.
In addition, you should use max_text_length > 300. Too small max_text_length will cause the text to be item["vqa_llva"] in most cases. Such text cannot substitute facial features images into text embedding.
@JackAILab
I printed the network parameter count for facial_encoder
. Could you share the training duration for training facial_encoder?
facial_encoder` contains: 105.772288 million parameters."
In the project code, the background loss is computed using
parsing_mask_lists
, but the final loss consists of the prediction loss and facial loss added together. Why isn't it the sum of all three losses? What is the purpose of setting thebackground_loss
?