bytedance / E2STR

The official code for the CVPR 2024 paper: Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer
Apache License 2.0
41 stars 4 forks source link

Some Problems in reproduce the code #3

Closed Z-yolo closed 3 weeks ago

Z-yolo commented 1 month ago

1724395464623 Your work is very inspiring, but I'm having some problems reproducing it, such as the weights of the VIT model in Flamingo.py “ViT-B-4”, since there doesn't seem to be a weight parameter for this version at the moment!

prefixRAINSTARsuffix commented 1 month ago

1724395464623 Your work is very inspiring, but I'm having some problems reproducing it, such as the weights of the VIT model in Flamingo.py “ViT-B-4”, since there doesn't seem to be a weight parameter for this version at the moment!

Yes. You have to modify this part in the open-flamingo library by adding a few lines of code. Please refer to the source code of "create_model_and_transforms" in the open-flamingo library.

Z-yolo commented 1 month ago

1724395464623 Your work is very inspiring, but I'm having some problems reproducing it, such as the weights of the VIT model in Flamingo.py “ViT-B-4”, since there doesn't seem to be a weight parameter for this version at the moment!

Yes. You have to modify this part in the open-flamingo library by adding a few lines of code. Please refer to the source code of "create_model_and_transforms" in the open-flamingo library.

Thank you for your reply, what you mean is that I need to find the relevant create_model_and_transforms code in the open-flamingo library and make some changes to it so that “ViT-B-4” can use MAE PRETRAIN WEIGHT? because your readme.md configuration is to modify the MAE PRETRAIN WEIGHT PATH section.

prefixRAINSTARsuffix commented 1 month ago

1724395464623 Your work is very inspiring, but I'm having some problems reproducing it, such as the weights of the VIT model in Flamingo.py “ViT-B-4”, since there doesn't seem to be a weight parameter for this version at the moment!

Yes. You have to modify this part in the open-flamingo library by adding a few lines of code. Please refer to the source code of "create_model_and_transforms" in the open-flamingo library.

Thank you for your reply, what you mean is that I need to find the relevant create_model_and_transforms code in the open-flamingo library and make some changes to it so that “ViT-B-4” can use MAE PRETRAIN WEIGHT? because your readme.md configuration is to modify the MAE PRETRAIN WEIGHT PATH section.

Yes. Simply adding a configuration (layer_num, width, num_head, etc) that supports the MAE weight.

Z-yolo commented 1 month ago

Thank you very much for every reply, I added the relevant configuration in Flamingo's library file create_model_and_transforms ---> open_clip.create_model_and_transforms, e.g. { “embed_dim": 768, “vision_cfg": { “image_size": [32, 128], { “layers": 12, “width": 768, “patch_size": 4 }, “text_cfg": { “context_length": 77, “vocab_size": 49408, ‘width’: 768, ‘patch_size’: 4 }, ‘text_cfg’: { ‘context_length’: 77, ‘vocab_size’: 49408, “width": 768, “heads": 12, “layers": 12 }

} But after many attempts, every time, before I can load the configuration for model creation, I already get a RuntimeError saying that this ViT-B-4 model configuration could not be found, and I really can't add it!

Could you please provide the details of the modification in your free time?

prefixRAINSTARsuffix commented 1 month ago

Thank you very much for every reply, I added the relevant configuration in Flamingo's library file create_model_and_transforms ---> open_clip.create_model_and_transforms, e.g. { “embed_dim": 768, “vision_cfg": { “image_size": [32, 128], { “layers": 12, “width": 768, “patch_size": 4 }, “text_cfg": { “context_length": 77, “vocab_size": 49408, ‘width’: 768, ‘patch_size’: 4 }, ‘text_cfg’: { ‘context_length’: 77, ‘vocab_size’: 49408, “width": 768, “heads": 12, “layers": 12 }

} But after many attempts, every time, before I can load the configuration for model creation, I already get a RuntimeError saying that this ViT-B-4 model configuration could not be found, and I really can't add it!

Could you please provide the details of the modification in your free time?

You should add the configuration in open_clip/src/open_clip/model_configs. (https://github.com/mlfoundations/open_clip/tree/main/src/open_clip/model_configs). Also refer to this file (https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/factory.py).

I have leaved the company and I could not provide the original file that I modified. But feel free to ask if you meet further problems.

Z-yolo commented 1 month ago

Hi author, I've been bothering you before, thank you very much for your previous reply to help me run the training part of the code. However, I see that in the evaluation section, since I need to design the “JSON FILE FOR CHARACTER-WISE POSITION INFORMATION” and the " In-Context Pool", you mentioned in the readme that I can put the ” JSON FILE FOR CHARACTER-WISE POSITION INFORMATION” to None, but ”In-Context Pool" (i.e., a json file) by randomly sampling data from any target training set”, does it mean we can sample data as in-context pool by ourselves according to the JSON format you gave? Can this be used as an evaluation setup?

prefixRAINSTARsuffix commented 1 month ago

Hi author, I've been bothering you before, thank you very much for your previous reply to help me run the training part of the code. However, I see that in the evaluation section, since I need to design the “JSON FILE FOR CHARACTER-WISE POSITION INFORMATION” and the " In-Context Pool", you mentioned in the readme that I can put the ” JSON FILE FOR CHARACTER-WISE POSITION INFORMATION” to None, but ”In-Context Pool" (i.e., a json file) by randomly sampling data from any target training set”, does it mean we can sample data as in-context pool by ourselves according to the JSON format you gave? Can this be used as an evaluation setup?

Yes. Please refer to our evaluation settings in Table 1 & 2. For example, if testing on WordArt, you can simply sample 100 examples from the training set of WordArt. You can also sample out-of-domain data, but the performance may be lower (refer to Table 5 in the supplementary).