Hi~ Thanks a lot for the new version code which have made the framework much easier to understand. But I noticed that some details have also changed, e.g., the tokenizer part:
old version:
def tokenizer_X_token(prompt, tokenizer, X_token_index, return_tensors=None):
prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split(f'<{X_INDEX_TOKEN[X_token_index].lower()}>')]
...
new version:
def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')]
...
Should I worry about any performance degradation? Since
it looks like the video and image are treated as the same?
the original training samples include symbols like \\n and \n\
In fact, I am trying to finetune with new modals like audio and depth, so is there any confict with current version (besides the languabind part)?
Hi~ Thanks a lot for the new version code which have made the framework much easier to understand. But I noticed that some details have also changed, e.g., the tokenizer part:
old version:
new version:
Should I worry about any performance degradation? Since
In fact, I am trying to finetune with new modals like audio and depth, so is there any confict with current version (besides the languabind part)?
Thank you so much~☺