Regarding the issue of fine-tuning Llava onevision

haozhang1234 commented 1 month ago

I encountered three problems The first one is that I found in the script finetune_ov.sh that in addition to the regular data_path and image_folder, there is also a video_folder. This confuses me a bit. Should the path in data_math be filled with the JSON of the image dataset or the JSON of the video dataset? Still use the onevision. YAML file provided in the document (my own image and video mixing path). The second issue is that I still reported an error while running the code, but during my actual inspection, I found that CUDNN should not have any problems and should be able to display the version number and run correctly. My versions are CUDA12.1 and CUDNN8.9.7 ! image *The third one is * I understand that the JSON format of the image text dataset should be ! image But I couldn't find the JSON training format for the video text dataset. I am looking at the training format of other related datasets' JSON training format (video chatgpt) and referring to this format to adjust the JSON training format of the dataset I am looking for { "id": "0", "video": "/4T/WK/MyProjects/LLaVA-NeXT/ZH-DataSet/vidio-dataset/datasets--lmms-lab--VideoDetailCaption/TestVideos/v-6dz6tBH77I.mp4", "conversations": [ { "from": "human", "value": "

haozhang1234 commented 1 month ago

This is the training format for my video dataset JSON I want to know if this is the JSON training format for video datasets. If it weren't for your assistance in providing a correct JSON training format, I would greatly appreciate it. I hope someone can answer my question. Thank you very much for your help!

amew0 commented 1 month ago

I believe the tokens <video> should be <image> based on the tutorial (On the Video Input section)

I did that and I was able to finetune normally (with LoRA) Let me know if you got this figured: on inference (after merging it) I got gibberish output. For reference my base model is lmms-lab/llava-onevision-qwen2-0.5b-ov

zhudongwork commented 1 month ago

我相信代币<video>应该<image>基于教程（在Video Input部分）

我这样做了，并且能够正常微调（使用 LoRA）如果您明白这一点，请告诉我：在推理时（合并后），我得到了乱码输出。作为参考，我的基础模型是lmms-lab/llava-onevision-qwen2-0.5b-ov

Have you encountered a situation where after fine-tuning a LoRA model and loading it for inference, the responses are confusing?

NicoZenith commented 3 weeks ago

what about multi-image? how should the json cell look like? in terms of image path and image tokens?

SVT-Yang commented 2 weeks ago

what about multi-image? how should the json cell look like? in terms of image path and image tokens?

Hello! Have you figured out? I used multi-image data to finetune llava-ov-0.5B, but gibberish output. Have you encountered the same situation? Thanks!

amew0 commented 2 weeks ago

what about multi-image? how should the json cell look like? in terms of image path and image tokens?

Even for multi-image as per the tutorial you need one image token (i,e <image>) and as per the json cell it should be

{
    "id": "ID",
    "image": ["path/to/image1.jpg", "path/to/image2.jpg", "/etc"],
    "conversations": [
        {
            "from": "human",
            "value": "<image>\nQuery"
        },
        {
            "from": "gpt",
            "value": "Response"
        }
    ]
}

this is because in the trainer the _get_item checks if its a list or a single image.

amew0 commented 2 weeks ago

Hello! Have you figured out? I used multi-image data to finetune llava-ov-0.5B, but gibberish output. Have you encountered the same situation? Thanks!

I had the same issue but noticed that when I finished training and tried to load it, it was warning me with "Some weights are randomly initialized and needs to be trained [lm_head.weights]" And the outputs then were gibberish.

For some setting reason the lm_head.weights are not trained (not sure what affects this) but to sovle it I had to attach thee parent model's lm_head weight to it by

model.lm_head.weight = deepcopy(m.lm_head.weight) 
# model is the FT'ed one m is the original one

SVT-Yang commented 2 weeks ago

what about multi-image? how should the json cell look like? in terms of image path and image tokens?

Even for multi-image as per the tutorial you need one image token (i,e <image>) and as per the json cell it should be
{
    "id": "ID",
    "image": ["path/to/image1.jpg", "path/to/image2.jpg", "/etc"],
    "conversations": [
        {
            "from": "human",
            "value": "<image>\nQuery"
        },
        {
            "from": "gpt",
            "value": "Response"
        }
    ]
}
this is because in the trainer the _get_item checks if its a list or a single image.

Thanks! In conversations, it should be only one "\<image>\n" token? Or should the number of IMAGE_TOKEN is the same with the image list length? It will be like this:"\<image>\n This is the first image. Can you describe what you see?\n\nNow, let's look at another image: \<image>\n What's the difference between these two images?" Thanks!

LLaVA-VL / LLaVA-NeXT

Regarding the issue of fine-tuning Llava onevision #273