Tried to run training and evaluation but ran into some issues.

Thank you so much for open sourcing this! It's a great effort and truly outstanding work!

I attempted to run the code but I got some questions:

Should "lmsys/vicuna-7b-v1.5" be used in the "tokenizer" fields of the configuration file? Specifically, I'm referring to line 43 and line 74 in the sample_config.yaml.
There seems to be a bug in the data loader code. Using the current implementation of "video_llm_data.py", a ValueError in line 238 will be raised. If I'm understanding correctly, this error occurs because the lengths of data_dict['frames'] and data_dict['box'] differ by the end of the sample_frames(...) function in line 351.

To resolve this issue, I inserted the following lines after line 359:
```
if 'box' in data_dict:
   data_dict['box'] = [data_dict['box'][i] for i in frames_index]
```
This modification ensures that only the boxes corresponding to the sampled frames are retained. Does this bug fix appear correct to you?
I've reformatted LaSOT's test set annotations (the version "Sequences of Testing set only" from http://vision.cs.stonybrook.edu/~lasot/download.html) into JSON format following the instructions in the Elysium GitHub readme. A sample entry in the JSON file looks like this:

{
  "source": "LaSOT_testset_airplane-1",
  "vid": 0,
  "frame_size": [640, 360],
  "frames": [
    "airplane-1/img/00000001.jpg",
    "airplane-1/img/00000002.jpg",
    "airplane-1/img/00000003.jpg",
    "airplane-1/img/00000004.jpg",
    "airplane-1/img/00000005.jpg",
    ...
  ],
  "box": [
    [0.5734375, 0.28055555555555556, 0.6375, 0.325],
    [0.571875, 0.2861111111111111, 0.6421875, 0.33055555555555555],
    [0.56875, 0.2972222222222222, 0.6390625, 0.3388888888888889],
    [0.565625, 0.30277777777777776, 0.6375, 0.3472222222222222],
    [0.565625, 0.30833333333333335, 0.6375, 0.35833333333333334],
    ...
  ],
  "caption": "white airplane landing on ground",
  "object_class": "airplane",
  "object_description": "airplane"
}

However, I've noticed that the inference speed is extremely slow—seemingly taking days to complete on a single H100 GPU with the default configurations. Is this inference time expected for LaSOT?

To provide more context, the reformatted JSON file contains 280 single-object trajectories, with some trajectories spanning thousands of frames. The following figure presents a histogram of trajectory lengths in LaSOT, measured by the number of frames per trajectory. According to the evaluation code, the total number of evaluation samples is 98,036. Does anything about this setup seem unusual or concerning to you? num_frames_histogram

Looking forward to your response!

Hon-Wong / Elysium

Tried to run training and evaluation but ran into some issues. #10