RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

Do we need to crop the HiREST videos? #10

Closed yeliudev closed 6 months ago

yeliudev commented 7 months ago

Hi @RenShuhuai-Andy, thanks for sharing this great work! For some videos in HiREST dataset, the filenames are "xxxx_35_79.mp4". Do we need to crop the original videos according to the timestamps in the filename (e.g., cropping the 35s to 79s in this case)?

RenShuhuai-Andy commented 7 months ago

Hi, thanks for your interest.

Yes, for HiREST_step task in TimeIT (instruct_action_0.5k_hirest.json), we need to crop the videos into clips. Details can be found in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-hirest.

yeliudev commented 7 months ago

Hi, thanks for your interest.

Yes, for HiREST_step task in TimeIT (instruct_action_0.5k_hirest.json), we need to crop the videos into clips. Details can be found in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-hirest.

Many thanks for your reply! For the VATEX videos from Valley, are the videos cropped according to the filenames as well?

yeliudev commented 7 months ago

Also, using the provided checkpoint, the evaluation results on Charades-STA are different from those reported in Table 2 of the paper. Below are my reproduced results.

# pred video timestamps 3720; # gt video timestamps 3720
IOU 0.3: 47.33870967741935
IOU 0.5: 28.091397849462364
IOU 0.7: 12.82258064516129

Is it because the released model differs from the one reported in the paper (it seems that the paper version is trained on TimeIT only, but the released one is also trained on Valley), or the results in the paper were obtained with ASR captions on Charades-STA? If so, would it be possible to share the code for obtaining ASR captions (the file in whisper_outputs_with_time/tiny.en.cleaned/)? Thank you!

RenShuhuai-Andy commented 6 months ago

Hi, thanks for your interest. Yes, for HiREST_step task in TimeIT (instruct_action_0.5k_hirest.json), we need to crop the videos into clips. Details can be found in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-hirest.

Many thanks for your reply! For the VATEX videos from Valley, are the videos cropped according to the filenames as well?

We follow the instructions from Valley (https://github.com/RupertLuo/Valley/blob/main/Crawler/README.md#vatex) to download the VATEX videos and do not conduct cropping. The ann_file can be found in https://eric-xw.github.io/vatex-website/data/vatex_training_v1.0.json

RenShuhuai-Andy commented 6 months ago

Also, using the provided checkpoint, the evaluation results on Charades-STA are different from those reported in Table 2 of the paper. Below are my reproduced results.

# pred video timestamps 3720; # gt video timestamps 3720
IOU 0.3: 47.33870967741935
IOU 0.5: 28.091397849462364
IOU 0.7: 12.82258064516129

Is it because the released model differs from the one reported in the paper (it seems that the paper version is trained on TimeIT only, but the released one is also trained on Valley), or the results in the paper were obtained with ASR captions on Charades-STA? If so, would it be possible to share the code for obtaining ASR captions (the file in whisper_outputs_with_time/tiny.en.cleaned/)? Thank you!

  1. The results reported in the paper were obtained using the TimeIT + Valley dataset (we will note this more clearly in our paper update), and we don't use asr in our evaluation. For your convenience, you can find the code for asr in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#automatic-speech-transcription

  2. Our released ckpt is different from the version used in the paper. The released ckpt was trained after cleaning the code and fixing a minor bug in QuerYD instructions data (some videos have the same start and end timestamps in the raw annotations file, so we only use one timestamp in the revision). In our evaluation, the performance of the released ckpt on YouCook2 is higher than that in the paper, while the performance on Charades-STS & QVHighlight is lower. We also note that the output generated by LLM is different each time, which may cause fluctuations in the evaluation results. Please that we know if you want the ckpt of the paper version, we can also upload it.

yeliudev commented 6 months ago

Hi, thanks for your interest. Yes, for HiREST_step task in TimeIT (instruct_action_0.5k_hirest.json), we need to crop the videos into clips. Details can be found in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-hirest.

Many thanks for your reply! For the VATEX videos from Valley, are the videos cropped according to the filenames as well?

We follow the instructions from Valley (https://github.com/RupertLuo/Valley/blob/main/Crawler/README.md#vatex) to download the VATEX videos and do not conduct cropping. The ann_file can be found in https://eric-xw.github.io/vatex-website/data/vatex_training_v1.0.json

Thanks for your detailed reply! But it seems that Valley cropped the VATEX videos according to the filenames (see here)...

yeliudev commented 6 months ago

Also, using the provided checkpoint, the evaluation results on Charades-STA are different from those reported in Table 2 of the paper. Below are my reproduced results.

# pred video timestamps 3720; # gt video timestamps 3720
IOU 0.3: 47.33870967741935
IOU 0.5: 28.091397849462364
IOU 0.7: 12.82258064516129

Is it because the released model differs from the one reported in the paper (it seems that the paper version is trained on TimeIT only, but the released one is also trained on Valley), or the results in the paper were obtained with ASR captions on Charades-STA? If so, would it be possible to share the code for obtaining ASR captions (the file in whisper_outputs_with_time/tiny.en.cleaned/)? Thank you!

  1. The results reported in the paper were obtained using the TimeIT + Valley dataset (we will note this more clearly in our paper update), and we don't use asr in our evaluation. For your convenience, you can find the code for asr in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#automatic-speech-transcription
  2. Our released ckpt is different from the version used in the paper. The released ckpt was trained after cleaning the code and fixing a minor bug in QuerYD instructions data (some videos have the same start and end timestamps in the raw annotations file, so we only use one timestamp in the revision). In our evaluation, the performance of the released ckpt on YouCook2 is higher than that in the paper, while the performance on Charades-STS & QVHighlight is lower. We also note that the output generated by LLM is different each time, which may cause fluctuations in the evaluation results. Please that we know if you want the ckpt of the paper version, we can also upload it.

I see. Many thanks for your explanation! Does it mean that TimeChat was trained with ASR, but not using it during evaluation for fair comparison with existing methods?

yeliudev commented 6 months ago

Hi, thanks for your interest. Yes, for HiREST_step task in TimeIT (instruct_action_0.5k_hirest.json), we need to crop the videos into clips. Details can be found in https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-hirest.

Many thanks for your reply! For the VATEX videos from Valley, are the videos cropped according to the filenames as well?

We follow the instructions from Valley (https://github.com/RupertLuo/Valley/blob/main/Crawler/README.md#vatex) to download the VATEX videos and do not conduct cropping. The ann_file can be found in https://eric-xw.github.io/vatex-website/data/vatex_training_v1.0.json

Thanks for your detailed reply! But it seems that Valley cropped the VATEX videos according to the filenames (see here)...

Also, some QA pairs are missing in instruct_valley_72k.json (from TimeIT) compared with Valley_instruct_73k.json (from Valley), e.g., video VATEX_--33Lscn6sk_000004_000014.

# instruct_valley_72k.json
{
    "video": "vatex/videos/--33Lscn6sk_000004_000014.mp4",
    "QA": [
        {
            "q": "Can you describe what is happening in the video?",
            "a": "In the video, a girl is flipping an omelette while a woman watches her. Then, the woman shakes a skillet and flips a pancake out of it and back into it before flipping it again. The girl is also seen flipping a pancake multiple times. Another scene shows a girl standing with a frying pan and flipping a pancake twice. Lastly, a female films herself flipping food on a pan twice while someone is sitting at a table in the background."
        },
        {
            "q": "What is the girl doing with the omelette?",
            "a": "The girl is flipping the omelette."
        },
        {
            "q": "Who is watching the girl?",
            "a": "A woman is watching the girl."
        }
    ]
}
# Valley_instruct_73k.json
{
    "id": "VATEX_--33Lscn6sk_000004_000014",
    "v_id": "--33Lscn6sk_000004_000014",
    "video": "v_--33Lscn6sk.mp4",
    "source": "VATEX",
    "conversations": [
        {
            "from": "human",
            "value": "Can you describe what is happening in the video?\n<video>"
        },
        {
            "from": "gpt",
            "value": "In the video, a girl is flipping an omelette while a woman watches her. Then, the woman shakes a skillet and flips a pancake out of it and back into it before flipping it again. The girl is also seen flipping a pancake multiple times. Another scene shows a girl standing with a frying pan and flipping a pancake twice. Lastly, a female films herself flipping food on a pan twice while someone is sitting at a table in the background."
        },
        {
            "from": "human",
            "value": "What is the girl doing with the omelette?"
        },
        {
            "from": "gpt",
            "value": "The girl is flipping the omelette."
        },
        {
            "from": "human",
            "value": "Who is watching the girl?"
        },
        {
            "from": "gpt",
            "value": "A woman is watching the girl."
        },
        {
            "from": "human",
            "value": "What does the woman do with the skillet?"
        },
        {
            "from": "gpt",
            "value": "The woman shakes the skillet."
        },
        {
            "from": "human",
            "value": "What does the woman flip with the skillet?"
        },
        {
            "from": "gpt",
            "value": "The woman flips a pancake with the skillet."
        },
        {
            "from": "human",
            "value": "How many times does the girl flip the pancake?"
        },
        {
            "from": "gpt",
            "value": "The girl flips the pancake twice."
        }
    ]
}
RenShuhuai-Andy commented 6 months ago

Thanks for your detailed reply! But it seems that Valley cropped the VATEX videos according to the filenames (see https://github.com/RupertLuo/Valley/issues/20)...

Yes you are right, sorry about that...

We have cropped the VATEX videos before training (which was done by my teammate), thus these is no problem with the released ckpt (the video filename in https://huggingface.co/datasets/ShuhuaiRen/TimeIT/blob/main/data/valley/instruct_valley_72k.json is also for the cropped version).

We have updated the code for processing the Valley dataset, please refer to https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/DATA.md#process-valley. We notice that the Valley dataset has been updated (from 73K to 65K), you can reprocess the instruction json if you want to use the new dataset :)

RenShuhuai-Andy commented 6 months ago

I see. Many thanks for your explanation! Does it mean that TimeChat was trained with ASR, but not using it during evaluation for fair comparison with existing methods?

Yes

RenShuhuai-Andy commented 6 months ago

Also, some QA pairs are missing in instruct_valley_72k.json (from TimeIT) compared with Valley_instruct73k.json (from Valley), e.g., video VATEX--33Lscn6sk_000004_000014.

Yes, we use half of the QA pairs for accelerating training. To use full of QA pairs, you can reprocess the Valley instruction json using https://github.com/RenShuhuai-Andy/TimeChat/blob/master/utils/process_valley.py

yeliudev commented 6 months ago

I see... Data preprocessing is always tricky 🤣 Thank you so much!

I have a final question regarding the batch size during instruction tuning and fine-tuning (sorry for asking so much...I'm trying my best to understand your method). According to the training config stage2_finetune_time104k_valley72k.yaml, during instruction tuning, we are using 8 GPUs, while each GPU have batch_size_train = 1 & accum_grad_iters = 4, such that the equivalent batch size shall be 8(GPUs) * 1 (per-device batch size) * 4 (accumulate iters) = 32, which is well-aligned with the paper. However, iters_per_epoch is set to 1/8 of the dataset size (rather than 1/32). Does it mean that the instruction tuning actually went through the dataset 12 times (i.e., 12 epochs) instead of 3?

Also, I have tried to find the config (number of GPUs, per device batch size, accumulate iters, and how to set iters_per_epoch) for fine-tuning on YouCook2, Charades-STA, and QVHighlights. But I found that different settings are used in stage2_finetune_{youcook2,charades,qvhighlights}.yaml, which are listed below:

# youcook2

# number of GPUs: unknown
iters_per_epoch: 1192 # 1192 / 1
batch_size_train: 2
accum_grad_iters: 4

# charades

# number of GPUs: unknown
iters_per_epoch: 3102 # 12408 / 4
batch_size_train: 1
accum_grad_iters: 8

# qvhighlights

# number of GPUs: unknown
iters_per_epoch: 1714 # 6858 / 4
batch_size_train: 1
accum_grad_iters: 8

I was wondering whether you could kindly clarify the settings for fine-tuning. Thank you!

RenShuhuai-Andy commented 6 months ago

However, iters_per_epoch is set to 1/8 of the dataset size (rather than 1/32). Does it mean that the instruction tuning actually went through the dataset 12 times (i.e., 12 epochs) instead of 3?

no. At each epoch, we conduct next(data_loader) (yielding 8 samples for 1x8 gpus) iters_per_epoch times (1/8 of the dataset), thus it spans for the whole dataset (see https://github.com/RenShuhuai-Andy/TimeChat/blob/master/timechat/tasks/base_task.py#L205).

accum_grad_iters is only used to control the frequency of parameters updating (see https://github.com/RenShuhuai-Andy/TimeChat/blob/master/timechat/tasks/base_task.py#L230), instead of the number of samples per iter.

Accordingly, the iters_per_epoch should be set to len(dataset)/num_of_gpus, and the actual bsz is batch_size_train * num_of_gpus * accum_grad_iters. For downstream dataset fine-tuning, you can try

# youcook2

# number of GPUs: 8
iters_per_epoch: 149 # 1192 / 8
batch_size_train: 1
accum_grad_iters: 4

# charades

# number of GPUs: 8
iters_per_epoch: 1551 # 12408 / 8
batch_size_train: 1
accum_grad_iters: 4

# qvhighlights

# number of GPUs: 8
iters_per_epoch: 858 # 6858 / 8
batch_size_train: 1
accum_grad_iters: 4

You can also increase the training epoch for better performance.

yeliudev commented 6 months ago

Thank you so much for your detailed reply!