Open Darren-greenhand opened 1 year ago
That could be possibly issue that you can not train with both single image-text datasets with multiple image-text datasets.
Because the vision_x
should be in a tensor shape like (B, T, F, C, H, W)
, here the T=1
means single images, T=x
means multiples in-context images. Same type of datasets should be arranged together so that enabling multi-batch training.
Can you try to load CDG with ic
series args.
parser.add_argument(
"--mimicit_ic_path",
type=str,
default="",
help="Path to the new in-context image-text dataset. Should be in format /path/to/xx_instruction.json",
)
parser.add_argument(
"--images_ic_path",
type=str,
default="",
help="Path to the new in-context images dataset. Should be in format /path/to/xx.json",
)
parser.add_argument(
"--train_config_ic_path",
type=str,
default="",
help="Path to the new in-context training config dataset. Should be in format /path/to/xx_train.json",
)
Supposedly you could use ic
(in-context) to load datasets that could possibly with multiples images as in-context examples.
That could be a issue in CGD
. But seemingly your error is from the json files, not the loading procedure.
HiοΌI tried the method and set the training args like
--mimicit_path="/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json,/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \
--images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json,/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json,/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \
--mimicit_ic_path="/tf/data/CGD/CGD_instructions.json" \
--images_ic_path="/tf/data/CGD/CGD.json" \
--train_config_ic_path="/tf/data/CGD/CGD_train.json" \
However , it raise the same problem, then I replace CGD
with 'SD' and found the same problem,
And when I tried to add the batch size to 2, another problem appears, even just on LA
--mimicit_path="/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json,/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \
--images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json,/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json,/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \
--batch_size 2 \
....
...
Original Traceback (most recent call last):
File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/tf/anaconda3/envs/otter/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/tf/Otter/pipeline/mimicit_utils/mimicit_dataset.py", line 627, in collate
res_v1 = collate_fn(
File "/tf/Otter/pipeline/mimicit_utils/mimicit_dataset.py", line 666, in collate_fn
batch["net_input"]["patch_images"] = torch.stack([sample["patch_images"] for sample in samples], dim=0)
RuntimeError: stack expects each tensor to be equal size, but got [1, 1, 3, 224, 224] at entry 0 and [3, 1, 3, 224, 224] at entry 1
I think it hard to run them together, can I run the datasets one by one and get a good result?: π
@ZhangYuanhan-AI
@Darren-greenhand
in mimicit_dataset.py
try to rewrite:
elif cur_train_id.startswith("SD"):
to
elif cur_train_id.startswith("SD") or cur_train_id.startswith("CGD")
And see whether if it works.
π Hi, It doesn't work and the error is the same
so I can't train with CGD
or SD
dataset when training across several datasets,
I can't set the batch >1 when training only 4 LA
datasets
I can't train SN dataset which I have processed, (I found strange matching relationship on itοΌonly replaced 00 maybe not help, I tried it and also tried to replace 00-06 but they both failed)
π
Try this. --mimicit_path="/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \ --images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \ --train_config_path="/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \ --mimicit_ic_path = "/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json" --images_ic_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \ --train_config_ic_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json" \ --mimicit_vt_path="/tf/data/CGD/CGD_instructions.json" \ --images_vt_path="/tf/data/CGD/CGD.json" \
The same problem QWQ
I tried the orjson.load
and images.update()
in ipython and it works well
May I refer the training_config you use to train the Otter
?
The same problem QWQ I tried the
orjson.load
andimages.update()
in ipython and it works well May I refer the training_config you use to train theOtter
?
Can you possible finger out which dataset causing this error?
--mimicit_path="/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" --images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" --train_config_path="/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json"
Can this configuration works?
@ZhangYuanhan-AI Sorry to reply late π , My server crushed last day I tried three dataset group(vanilla,γicγvt(SD+CGD)) and found that :
vanillaγicγvt
work well when trained alonevanilla+ic+SD
γvanilla+ic
γvanilla+CGD\SD
γic+CGD\SD
vanilla+ic+vt/CGD
γvanilla+vt
γic+vt
γIt is strange, SD
and CGD
can trained vanilla
or ic
, and SD
can trained with CGD
, but when vanilla\ic + vt
it failed
π±
vanilla:
--mimicit_path="/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \
--images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_path="/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \
ic:
--mimicit_ic_path="/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json" \
--images_ic_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_ic_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json" \
vt:
--mimicit_vt_path="/tf/data/SD/SD_instructions.json,/tf/data/CGD/CGD_instructions.json" \
--images_vt_path="/tf/data/SD/SD.json,/tf/data/CGD/CGD.json" \
Ok.
Have you rewrite:
elif cur_train_id.startswith("SD"):
to
elif cur_train_id.startswith("SD") or cur_train_id.startswith("CGD"):
?
Ok.
Have you rewrite:
elif cur_train_id.startswith("SD"):
to
elif cur_train_id.startswith("SD") or cur_train_id.startswith("CGD"):
?
@Darren-greenhand I think this would address the problem.
@Luodian @ZhangYuanhan-AI Hi, I'm sure I have rewritten it before my test π―
Weird. We will test it tomorrow, stay tuned.
Thx a lot for your great work and your help QWQ π As a greenhand I learned a lot ππ»ββοΈ
Thx a lot for your great work and your help QWQ π As a greenhand I learned a lot ππ»ββοΈ
Try this branch please. https://github.com/Luodian/Otter/tree/yhzhang/dev_otter_l
In this branch. Code runs well.
And we will merge this branch to the main soon.
Hi, I think I made a mistake and I use the old training script when I use the latest version yesterday night so it stills comes with the same problem when I use:
--mimicit_path="/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \
--images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_path="/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \
--mimicit_ic_path="/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json" \
--images_ic_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \
--train_config_ic_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json" \
--mimicit_vt_path="/tf/data/CGD/CGD_instructions.json,/tf/data/SD/SD_instructions.json" \
--images_vt_path="/tf/data/CGD/CGD.json,/tf/data/SD/SD.json" \
And I got the same problem QWQ, is this the problem of my server? but I can run it well with ipython
Traceback (most recent call last):
File "/tf/Otter/pipeline/train/instruction_following.py", line 656, in <module>
main()
File "/tf/Otter/pipeline/train/instruction_following.py", line 523, in main
mimicit_loaders = get_data(args, image_processor, tokenizer, "mimicit")
File "/tf/Otter/pipeline/train/data.py", line 656, in get_data
return get_dataset_fn(dataset_type)(args, image_processor=image_processor, epoch=epoch, tokenizer=tokenizer)
File "/tf/Otter/pipeline/train/data.py", line 580, in get_mimicit_dataset
unified_dataset = MimicitDataset(args, all_mimicit_path, all_images_path, all_train_config_path, status_list=status)
File "/tf/Otter/pipeline/mimicit_utils/mimicit_dataset.py", line 130, in __init__
self.images.update(orjson.loads(f.read()))
orjson.JSONDecodeError: memory allocation failed: line 1 column 1 (char 0)
Hi, I think I made a mistake and I use the old training script when I use the latest version yesterday night so it stills comes with the same problem when I use:
--mimicit_path="/tf/data/LA/LACONV_instructions.json,/tf/data/LA/LADD_instructions.json" \ --images_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \ --train_config_path="/tf/data/LA/LACONV_train.json,/tf/data/LA/LADD_train.json" \ --mimicit_ic_path="/tf/data/LA/LACR_I2I_instructions.json,/tf/data/LA/LACR_T2T_instructions.json" \ --images_ic_path="/tf/data/LA/LA.json,/tf/data/LA/LA.json" \ --train_config_ic_path="/tf/data/LA/LACR_I2I_train.json,/tf/data/LA/LACR_T2T_train.json" \ --mimicit_vt_path="/tf/data/CGD/CGD_instructions.json,/tf/data/SD/SD_instructions.json" \ --images_vt_path="/tf/data/CGD/CGD.json,/tf/data/SD/SD.json" \
And I got the same problem QWQ, is this the problem of my server? but I can run it well with
ipython
Traceback (most recent call last): File "/tf/Otter/pipeline/train/instruction_following.py", line 656, in <module> main() File "/tf/Otter/pipeline/train/instruction_following.py", line 523, in main mimicit_loaders = get_data(args, image_processor, tokenizer, "mimicit") File "/tf/Otter/pipeline/train/data.py", line 656, in get_data return get_dataset_fn(dataset_type)(args, image_processor=image_processor, epoch=epoch, tokenizer=tokenizer) File "/tf/Otter/pipeline/train/data.py", line 580, in get_mimicit_dataset unified_dataset = MimicitDataset(args, all_mimicit_path, all_images_path, all_train_config_path, status_list=status) File "/tf/Otter/pipeline/mimicit_utils/mimicit_dataset.py", line 130, in __init__ self.images.update(orjson.loads(f.read())) orjson.JSONDecodeError: memory allocation failed: line 1 column 1 (char 0)
Hi, this error might not from our code, as we can run this code smoothly. Maybe this error comes from your cpu memory .
ok π Let me try try
I successfully run the training code on each datasets, and I want to train them all with the config:
and it fails , I tried several times and found that once I add CGD dataset, it will come out strange error while it can be trained on its own. the error stack:
I follow the code but I can't understand why it happened as the answer I googled said 'f.read()' is empty given that it works well when single?! π