PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
3.03k stars 220 forks source link

Tokenizer in different code version #125

Closed countytown closed 5 months ago

countytown commented 8 months ago

Hi~ Thanks a lot for the new version code which have made the framework much easier to understand. But I noticed that some details have also changed, e.g., the tokenizer part:

old version:

def tokenizer_X_token(prompt, tokenizer, X_token_index, return_tensors=None):
    prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split(f'<{X_INDEX_TOKEN[X_token_index].lower()}>')]
    ...

new version:

def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
    prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')]
    ...

Should I worry about any performance degradation? Since

  1. it looks like the video and image are treated as the same?
  2. the original training samples include symbols like \\n and \n\

In fact, I am trying to finetune with new modals like audio and depth, so is there any confict with current version (besides the languabind part)?

Thank you so much~☺