Hon-Wong / Elysium

[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
https://hon-wong.github.io/Elysium/
59 stars 2 forks source link

Tried to run training and evaluation but ran into some issues. #10

Closed hongluzhou closed 2 months ago

hongluzhou commented 3 months ago

Thank you so much for open sourcing this! It's a great effort and truly outstanding work!

I attempted to run the code but I got some questions:

  1. Should "lmsys/vicuna-7b-v1.5" be used in the "tokenizer" fields of the configuration file? Specifically, I'm referring to line 43 and line 74 in the sample_config.yaml.

  2. There seems to be a bug in the data loader code. Using the current implementation of "video_llm_data.py", a ValueError in line 238 will be raised. If I'm understanding correctly, this error occurs because the lengths of data_dict['frames'] and data_dict['box'] differ by the end of the sample_frames(...) function in line 351.

    To resolve this issue, I inserted the following lines after line 359:

    if 'box' in data_dict:
       data_dict['box'] = [data_dict['box'][i] for i in frames_index]

    This modification ensures that only the boxes corresponding to the sampled frames are retained. Does this bug fix appear correct to you?

  3. I've reformatted LaSOT's test set annotations (the version "Sequences of Testing set only" from http://vision.cs.stonybrook.edu/~lasot/download.html) into JSON format following the instructions in the Elysium GitHub readme. A sample entry in the JSON file looks like this:

{
  "source": "LaSOT_testset_airplane-1",
  "vid": 0,
  "frame_size": [640, 360],
  "frames": [
    "airplane-1/img/00000001.jpg",
    "airplane-1/img/00000002.jpg",
    "airplane-1/img/00000003.jpg",
    "airplane-1/img/00000004.jpg",
    "airplane-1/img/00000005.jpg",
    ...
  ],
  "box": [
    [0.5734375, 0.28055555555555556, 0.6375, 0.325],
    [0.571875, 0.2861111111111111, 0.6421875, 0.33055555555555555],
    [0.56875, 0.2972222222222222, 0.6390625, 0.3388888888888889],
    [0.565625, 0.30277777777777776, 0.6375, 0.3472222222222222],
    [0.565625, 0.30833333333333335, 0.6375, 0.35833333333333334],
    ...
  ],
  "caption": "white airplane landing on ground",
  "object_class": "airplane",
  "object_description": "airplane"
}

However, I've noticed that the inference speed is extremely slow—seemingly taking days to complete on a single H100 GPU with the default configurations. Is this inference time expected for LaSOT?

To provide more context, the reformatted JSON file contains 280 single-object trajectories, with some trajectories spanning thousands of frames. The following figure presents a histogram of trajectory lengths in LaSOT, measured by the number of frames per trajectory. According to the evaluation code, the total number of evaluation samples is 98,036. Does anything about this setup seem unusual or concerning to you? num_frames_histogram

Looking forward to your response!

Hon-Wong commented 3 months ago

Thanks for your attention!

  1. Yes
  2. The boxes are sampled here, and your modification also works
  3. Evaluation inference requires approximately a day on 8 A100 GPUs; you may consider utilizing flash-attention-2 for faster processing.

P.S. The checkpoint performance in the repository is expected to slightly surpass what was reported in the paper due to code refactoring (The AUC on LaSOT is around 58 compared to the reported 56.1).

Feel free to reach out to me if you encounter any problems reproducing the results.

Best regards, Han