CDG can't train with other datasets

Darren-greenhand commented 1 year ago

I successfully run the training code on each datasets, and I want to train them all with the config:

mimicit_path="/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json,/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json,/tf/data/SD/SD_instructions.json,/tf/data/CGD/CGD_instructions.json" \
--images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json,/tf/data/LA/LA.json,/tf/data/LA/LA.json,/tf/data/SD/SD.json,/tf/data/CGD/CGD.json" \
--train_config_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json,/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json,/tf/data/SD/SD_train.json,/tf/data/CGD/CGD_train.json" \

and it fails , I tried several times and found that once I add CGD dataset, it will come out strange error while it can be trained on its own. the error stack:

Traceback (most recent call last):
  File "/tf/Otter/pipeline/train/instruction_following.py", line 656, in <module>
    main()
  File "/tf/Otter/pipeline/train/instruction_following.py", line 523, in main
    mimicit_loaders = get_data(args, image_processor, tokenizer, "mimicit")
  File "/tf/Otter/pipeline/train/data.py", line 656, in get_data
    return get_dataset_fn(dataset_type)(args, image_processor=image_processor, epoch=epoch, tokenizer=tokenizer)
  File "/tf/Otter/pipeline/train/data.py", line 580, in get_mimicit_dataset
    unified_dataset = MimicitDataset(args, all_mimicit_path, all_images_path, all_train_config_path, status_list=status)
  File "/tf/Otter/pipeline/mimicit_utils/mimicit_dataset.py", line 130, in __init__
    self.images.update(orjson.loads(f.read()))
orjson.JSONDecodeError: memory allocation failed: line 1 column 1 (char 0)

I follow the code but I can't understand why it happened as the answer I googled said 'f.read()' is empty given that it works well when single?! 😂

Luodian commented 1 year ago

That could be possibly issue that you can not train with both single image-text datasets with multiple image-text datasets.

Because the vision_x should be in a tensor shape like (B, T, F, C, H, W), here the T=1 means single images, T=x means multiples in-context images. Same type of datasets should be arranged together so that enabling multi-batch training.

Can you try to load CDG with ic series args.

    parser.add_argument(
        "--mimicit_ic_path",
        type=str,
        default="",
        help="Path to the new in-context image-text dataset. Should be in format /path/to/xx_instruction.json",
    )
    parser.add_argument(
        "--images_ic_path",
        type=str,
        default="",
        help="Path to the new in-context images dataset. Should be in format /path/to/xx.json",
    )
    parser.add_argument(
        "--train_config_ic_path",
        type=str,
        default="",
        help="Path to the new in-context training config dataset. Should be in format /path/to/xx_train.json",
    )

Supposedly you could use ic (in-context) to load datasets that could possibly with multiples images as in-context examples.

Luodian commented 1 year ago

That could be a issue in CGD. But seemingly your error is from the json files, not the loading procedure.

Darren-greenhand commented 1 year ago

Hi！I tried the method and set the training args like

--mimicit_path="/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json,/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \
--images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json,/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json,/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \
--mimicit_ic_path="/tf/data/CGD/CGD_instructions.json" \
--images_ic_path="/tf/data/CGD/CGD.json" \
--train_config_ic_path="/tf/data/CGD/CGD_train.json" \

However , it raise the same problem, then I replace CGD with 'SD' and found the same problem,

And when I tried to add the batch size to 2, another problem appears, even just on LA

--mimicit_path="/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json,/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \
--images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json,/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json,/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \
--batch_size  2 \
....
...
Original Traceback (most recent call last):
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/tf/Otter/pipeline/mimicit_utils/mimicit_dataset.py", line 627, in collate
    res_v1 = collate_fn(
  File "/tf/Otter/pipeline/mimicit_utils/mimicit_dataset.py", line 666, in collate_fn
    batch["net_input"]["patch_images"] = torch.stack([sample["patch_images"] for sample in samples], dim=0)
RuntimeError: stack expects each tensor to be equal size, but got [1, 1, 3, 224, 224] at entry 0 and [3, 1, 3, 224, 224] at entry 1


I think it hard to run them together, can I run the datasets one by one and get a good result?: 😂

Luodian commented 1 year ago

@ZhangYuanhan-AI

ZhangYuanhan-AI commented 1 year ago

@Darren-greenhand

in mimicit_dataset.py

try to rewrite:

elif cur_train_id.startswith("SD"):

to

elif cur_train_id.startswith("SD") or cur_train_id.startswith("CGD")

And see whether if it works.

Darren-greenhand commented 1 year ago

😂 Hi, It doesn't work and the error is the same so I can't train with CGD or SD dataset when training across several datasets, I can't set the batch >1 when training only 4 LA datasets I can't train SN dataset which I have processed, (I found strange matching relationship on it，only replaced 00 maybe not help, I tried it and also tried to replace 00-06 but they both failed) 😂

ZhangYuanhan-AI commented 1 year ago

Try this. --mimicit_path="/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \ --images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \ --train_config_path="/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \ --mimicit_ic_path = "/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json" --images_ic_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \ --train_config_ic_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json" \ --mimicit_vt_path="/tf/data/CGD/CGD_instructions.json" \ --images_vt_path="/tf/data/CGD/CGD.json" \

Darren-greenhand commented 1 year ago

The same problem QWQ I tried the orjson.load and images.update() in ipython and it works well May I refer the training_config you use to train the Otter?

ZhangYuanhan-AI commented 1 year ago

The same problem QWQ I tried the orjson.load and images.update() in ipython and it works well May I refer the training_config you use to train the Otter?

Can you possible finger out which dataset causing this error?

--mimicit_path="/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" --images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" --train_config_path="/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json"

Can this configuration works?

Darren-greenhand commented 1 year ago

@ZhangYuanhan-AI Sorry to reply late 😭 , My server crushed last day I tried three dataset group(vanilla,、ic、vt(SD+CGD)) and found that :

vanilla、ic、vt work well when trained alone
Succeed: vanilla+ic+SD、vanilla+ic 、vanilla+CGD\SD、ic+CGD\SD
Failed: vanilla+ic+vt/CGD、vanilla+vt、ic+vt、

It is strange, SD and CGD can trained vanilla or ic, and SD can trained with CGD, but when vanilla\ic + vt it failed 😱

vanilla:

--mimicit_path="/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \
--images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_path="/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \

ic:

--mimicit_ic_path="/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json" \
--images_ic_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_ic_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json" \

vt:

--mimicit_vt_path="/tf/data/SD/SD_instructions.json,/tf/data/CGD/CGD_instructions.json" \
--images_vt_path="/tf/data/SD/SD.json,/tf/data/CGD/CGD.json" \

ZhangYuanhan-AI commented 1 year ago

Ok.

Have you rewrite:

elif cur_train_id.startswith("SD"):

to

elif cur_train_id.startswith("SD") or cur_train_id.startswith("CGD"):

?

Luodian commented 1 year ago

Ok.

Have you rewrite:

elif cur_train_id.startswith("SD"):

to

elif cur_train_id.startswith("SD") or cur_train_id.startswith("CGD"):

?

@Darren-greenhand I think this would address the problem.

Darren-greenhand commented 1 year ago

@Luodian @ZhangYuanhan-AI Hi, I'm sure I have rewritten it before my test 💯

ZhangYuanhan-AI commented 1 year ago

Weird. We will test it tomorrow, stay tuned.

Darren-greenhand commented 1 year ago

Thx a lot for your great work and your help QWQ 😭 As a greenhand I learned a lot 🙇🏻‍♂️

ZhangYuanhan-AI commented 1 year ago

Thx a lot for your great work and your help QWQ 😭 As a greenhand I learned a lot 🙇🏻‍♂️

Try this branch please. https://github.com/Luodian/Otter/tree/yhzhang/dev_otter_l

In this branch. Code runs well.

And we will merge this branch to the main soon.

Darren-greenhand commented 1 year ago

Hi, I think I made a mistake and I use the old training script when I use the latest version yesterday night so it stills comes with the same problem when I use:

--mimicit_path="/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \
--images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_path="/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \
--mimicit_ic_path="/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json" \
--images_ic_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_ic_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json" \
--mimicit_vt_path="/tf/data/CGD/CGD_instructions.json,/tf/data/SD/SD_instructions.json" \
--images_vt_path="/tf/data/CGD/CGD.json,/tf/data/SD/SD.json" \

And I got the same problem QWQ, is this the problem of my server? but I can run it well with ipython

Traceback (most recent call last):
  File "/tf/Otter/pipeline/train/instruction_following.py", line 656, in <module>
    main()
  File "/tf/Otter/pipeline/train/instruction_following.py", line 523, in main
    mimicit_loaders = get_data(args, image_processor, tokenizer, "mimicit")
  File "/tf/Otter/pipeline/train/data.py", line 656, in get_data
    return get_dataset_fn(dataset_type)(args, image_processor=image_processor, epoch=epoch, tokenizer=tokenizer)
  File "/tf/Otter/pipeline/train/data.py", line 580, in get_mimicit_dataset
    unified_dataset = MimicitDataset(args, all_mimicit_path, all_images_path, all_train_config_path, status_list=status)
  File "/tf/Otter/pipeline/mimicit_utils/mimicit_dataset.py", line 130, in __init__
    self.images.update(orjson.loads(f.read()))
orjson.JSONDecodeError: memory allocation failed: line 1 column 1 (char 0)

ZhangYuanhan-AI commented 1 year ago

Hi, I think I made a mistake and I use the old training script when I use the latest version yesterday night so it stills comes with the same problem when I use:

--mimicit_path="/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \
--images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_path="/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \
--mimicit_ic_path="/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json" \
--images_ic_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_ic_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json" \
--mimicit_vt_path="/tf/data/CGD/CGD_instructions.json,/tf/data/SD/SD_instructions.json" \
--images_vt_path="/tf/data/CGD/CGD.json,/tf/data/SD/SD.json" \

And I got the same problem QWQ, is this the problem of my server? but I can run it well with ipython

Traceback (most recent call last):
  File "/tf/Otter/pipeline/train/instruction_following.py", line 656, in <module>
    main()
  File "/tf/Otter/pipeline/train/instruction_following.py", line 523, in main
    mimicit_loaders = get_data(args, image_processor, tokenizer, "mimicit")
  File "/tf/Otter/pipeline/train/data.py", line 656, in get_data
    return get_dataset_fn(dataset_type)(args, image_processor=image_processor, epoch=epoch, tokenizer=tokenizer)
  File "/tf/Otter/pipeline/train/data.py", line 580, in get_mimicit_dataset
    unified_dataset = MimicitDataset(args, all_mimicit_path, all_images_path, all_train_config_path, status_list=status)
  File "/tf/Otter/pipeline/mimicit_utils/mimicit_dataset.py", line 130, in __init__
    self.images.update(orjson.loads(f.read()))
orjson.JSONDecodeError: memory allocation failed: line 1 column 1 (char 0)

Hi, this error might not from our code, as we can run this code smoothly. Maybe this error comes from your cpu memory .

Darren-greenhand commented 1 year ago

ok 😂 Let me try try

Luodian / Otter

CDG can't train with other datasets #229