Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.56k stars 241 forks source link

How to select the '_instructions.json' and '_train.json' file of the 'LA' task? #209

Open Hongbin98 opened 1 year ago

Hongbin98 commented 1 year ago

As state in the latest paper, ' Trained on the LA task, the model exhibits exceptional scene comprehension, reasoning abilities, and multi-round conversation capabilities.'

I am very interested in this part and want to train the otter based on 'LA'. However, in the LA.zip, there are several '_instructions.json' and '_train.json' files. So I am confused to select the corresponding files to train my model.

Could you provide the training command with me? Thanks~

Hongbin98 commented 1 year ago

I am willing to see that you have already noticed this issue. :)

I just tried to train the otter-7b on the 'LACONV' splits. And there may be a minor issue that why do we need to train so much steps (a total of 126405 iterations) to achieve a model, even if it seems to be converged in 200 steps?

image

Hongbin98 commented 1 year ago

Note that I also set the num_epochs=9, following the default settings.

linziyi96 commented 1 year ago

Hi! I'm also super interested in the LA-interleaved dataset but seems to have missed a lot of details about it. Could someone give any help about the meaning of the abbreviations? Are there any information about how exactly each part/version is constructed? I do notice a few places in the paper (https://arxiv.org/pdf/2306.05425.pdf) referring to the appendix for the details but all appendix sections seem to be unrelated.

ZhangYuanhan-AI commented 1 year ago

Hi! I'm also super interested in the LA-interleaved dataset but seems to have missed a lot of details about it. Could someone give any help about the meaning of the abbreviations? Are there any information about how exactly each part/version is constructed? I do notice a few places in the paper (https://arxiv.org/pdf/2306.05425.pdf) referring to the appendix for the details but all appendix sections seem to be unrelated.

Hi ziyi,

Generally, LA-interleaved is build by retrieving in-context examples for each (Q,A,I) triplets in LLaVA complex reasoning data, building a multi-modal in-context learning format.

The motivation behind building this LA-interleaved is that W\we experimentally found that without such data, instruct-tuned flamingo would loss its in-context learning ability.

There are two ways to retrieve in-context examples, given a query (Q(uestion),A(nswer),I(mage))

  1. Finding the top-k similar questions for Q, by "https://api-inference.huggingface.co/models/sentence-transformers/all-MiniLM-L6-v2", which is the LA_T2T
  2. Finding the top-k similar image for I, by CLIP B/16, which is the LA_I2I.
vishaal27 commented 1 year ago

Thanks for the response @ZhangYuanhan-AI, could you also describe the LAConv and LADD splits?