Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.52k stars 241 forks source link

Recommended next steps #321

Closed StrangeTcy closed 6 months ago

StrangeTcy commented 7 months ago

Following our finetuning attempts (https://github.com/Luodian/Otter/issues/320), we now have checkpoints for OtterHD finetuned on 2 datasets from the LA part of all possible datasets. Now we have two choices:

  1. try and evaluate them -- that'd be evaluating a model that's been finetuned on 2 datasets out of potential hundreds via a large number of calls to GPT4 (short version: I expect it to be not great)
  2. try running inference on it and look for ourselves, get a "feel" o the model's performance.

For that second option we can use different scripts, like cli.py and gradio_web_server.py and possibly others. But perhaps there're scripts you recommend as best for OtterHD specifically?

ETA: there's also inference.py in demos, it just requires a yaml file, and we don't have any

UPD: I've modified inference.py a bit and wrote a yaml file with a single question about a single image (then one from the demo with rows of apples). The model we have now can answer that question, it just answers it wrong. So should we finetune it further?

Luodian commented 6 months ago

I am not pretty sure what's your used datasets are? Could you provide more information?

If you are only use LA_DD and LA_CONV and LACR_T2T, finetune on 512x512 resolutions, with 2-3 epochs would take 1-2 hours. The model would then show a sign of life.

As for hosting the model, please use the

endpoint code: https://github.com/Luodian/Otter/blob/main/pipeline/serve/deploy/otterhd_endpoint.py

frontend code: https://huggingface.co/spaces/Otter-AI/OtterHD-Demo/blob/main/app.py

StrangeTcy commented 6 months ago

I am not pretty sure what's your used datasets are? Could you provide more information?

If you are only use LA_DD and LA_CONV and LACR_T2T, finetune on 512x512 resolutions, with 2-3 epochs would take 1-2 hours. The model would then show a sign of life. -- https://github.com/Luodian/Otter/blob/main/shared_scripts/Demo_Data.yaml doesn't mention LACONV, so we didn't use, even though https://entuedu-my.sharepoint.com/personal/libo0013_e_ntu_edu_sg/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Flibo0013%5Fe%5Fntu%5Fedu%5Fsg%2FDocuments%2FMIMICIT%5FParquets&ga=1 has LACONV_instructions.json

So, we used the finetuning script you suggested (https://github.com/Luodian/Otter/blob/main/docs/OtterHD.md#how-to-finetune), and it could only work with a batch size of 1, which eventually took about 11 hours. Not sure about signs of life, but we got checkpoints which work ok with inference.py, but now we want to have a gradio service out of it as well

As for hosting the model, please use the

endpoint code: https://github.com/Luodian/Otter/blob/main/pipeline/serve/deploy/otterhd_endpoint.py

-- that looks great, except it's a flask app, which we'll have trouble with accessing publicly (it would probably run alright on a local machine, but then so would a console inference script)

frontend code: https://huggingface.co/spaces/Otter-AI/OtterHD-Demo/blob/main/app.py

-- that's the one that you use currently, except this one has a really simple definition of http_bot, and the original gradio_web_server has a much longer and more complicated one

ETA: actually, we have now tried to run a modified app.py with a url that otterhd_endpoint outputs, it just didn't work. I'm also not sure the fn from the vqa_btn.click() ever actually gets called in our case.

StrangeTcy commented 6 months ago

Ok, we've been using gradio==3.23.0, which was probably a bad idea. Switching to 4.11.0 made everything work. Specifically, it became apparent that the model is poorly finetuned on only two sets of instructions. So, the question now is "how do we go on finetuning it?"; the instruction_following script seems to accepts fuyu as an arg, not an already finetuned otterhd checkpoint.

ETA: train_args.py has this thingy:

parser.add_argument(
        "--trained_ckpt",
        type=str,
        help="path to trained_ckpt",
        default=None,
    )

, while instruction_following.py has

if args.trained_ckpt is not None:
        train_ckpt = torch.load(args.trained_ckpt, map_location="cpu")
        if train_ckpt.get("model_state_dict", None) is not None:
            train_ckpt = train_ckpt["model_state_dict"]
        _ = model.load_state_dict(train_ckpt, strict=False)
        print(_[1])

I'm just not sure it'd work with the checkpoints we already have

Another thing is the dataset itself: let's take CGD as an example. As I understand it, we're supposed to have lines like

IMAGE_TEXT: # Group name should be in [IMAGE_TEXT, TEXT_ONLY, IMAGE_TEXT_IN_CONTEXT]
  CGD: #  dataset name can be assigned at any name you want
      mimicit_path: data_folder/json/CGD/CGD_instructions.json # Path of the instruction json file
      images_path: data_folder/Parquets/CGD.parquet # Path of the image parquet file
      num_samples: -1 # Number of samples you want to use, -1 means use all samples, if not set, default is -1.

Now if we look at the files on huggingface, https://huggingface.co/datasets/pufanyi/MIMICIT/tree/main/data/CGD, CGD_instructions.json is present and small, CGD.json is also present and huge, but isn't used by this setup, and there're 9 parts of CGD.parquet. How should we use all that?

demoninpiano commented 6 months ago

Hi @StrangeTcy , how do you get OtterHD checkpoint to finetune it?

StrangeTcy commented 6 months ago

Hi @StrangeTcy , how do you get OtterHD checkpoint to finetune it?

I follow the instructions from the OtterHD readme: https://github.com/Luodian/Otter/blob/main/docs/OtterHD.md#how-to-finetune which gets you a folder with a lot of jsons and one huge pytorch_model.bin