TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models
https://arxiv.org/abs/2402.14289
Apache License 2.0
569 stars 53 forks source link

【TinyLLaVA-3.1B Inference】 #4

Closed Luo-Z13 closed 6 months ago

Luo-Z13 commented 6 months ago

Hello, I would like to know how to perform inference with TinyLLaVA-3.1B? Simply replacing the model_id in the script of tiny-llava-v1-hf with TinyLLaVA-3.1B results in an error: 'You are using a model of type tiny_llava_phi to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.

baichuanzhou commented 6 months ago

The tiny-llava-v1-hf is our legacy model and it is compatible with native huggingface, as the weights have been converted to the hf implementation. If you want to load the legacy model, you should check out our model card.

To use TinyLLaVA-3.1B, we have updated our readme file. The warnings can be ignored as they do not affect performance (they are hf integration warings).

Luo-Z13 commented 6 months ago

The tiny-llava-v1-hf is our legacy model and it is compatible with native huggingface, as the weights have been converted to the hf implementation. If you want to load the legacy model, you should check out our model card.

To use TinyLLaVA-3.1B, we have updated our readme file. The warnings can be ignored as they do not affect performance (they are hf integration warings).

Thank you very much! However, I encountered the following error during the inference. Do I need to compile the environment in TinyLLAVA?

[2024-02-24 12:46:22,640] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using a model of type siglip_vision_model to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
  File "/media/dell/data1/TinyLLaVABench/inference_tiny_llava.py", line 133, in <module>
    tokenizer, model, image_processor, context_len = load_pretrained_model(
  File "/media/dell/data1/TinyLLaVABench/tinyllava/model/builder.py", line 127, in load_pretrained_model
    model = TinyLlavaPhiForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/media/dell/data1/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3594, in from_pretrained
    no_split_modules = model._get_no_split_modules(device_map)
  File "/media/dell/data1/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1690, in _get_no_split_modules
    raise ValueError(
ValueError: TinyLlavaPhiForCausalLM does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.
baichuanzhou commented 6 months ago

The tiny-llava-v1-hf is our legacy model and it is compatible with native huggingface, as the weights have been converted to the hf implementation. If you want to load the legacy model, you should check out our model card. To use TinyLLaVA-3.1B, we have updated our readme file. The warnings can be ignored as they do not affect performance (they are hf integration warings).

Thank you very much! However, I encountered the following error during the inference. Do I need to compile the environment in TinyLLAVA?

[2024-02-24 12:46:22,640] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using a model of type siglip_vision_model to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
  File "/media/dell/data1/TinyLLaVABench/inference_tiny_llava.py", line 133, in <module>
    tokenizer, model, image_processor, context_len = load_pretrained_model(
  File "/media/dell/data1/TinyLLaVABench/tinyllava/model/builder.py", line 127, in load_pretrained_model
    model = TinyLlavaPhiForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/media/dell/data1/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3594, in from_pretrained
    no_split_modules = model._get_no_split_modules(device_map)
  File "/media/dell/data1/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1690, in _get_no_split_modules
    raise ValueError(
ValueError: TinyLlavaPhiForCausalLM does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.

Sorry for the delay. We have updated on how to install relevant enviroments and packages here.❤️

Luo-Z13 commented 6 months ago

compile

Thanks for your timely reply. Another new error:

- This IS expected if you are initializing TinyLlavaPhiForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TinyLlavaPhiForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
  File "/media/dell/data1/TinyLLaVABench/predict_RS_img_background_generation_tiny_llava.py", line 133, in <module>
    tokenizer, model, image_processor, context_len = load_pretrained_model(
  File "/media/dell/data1/TinyLLaVABench/tinyllava/model/builder.py", line 145, in load_pretrained_model
    vision_tower.load_model()
  File "/media/dell/data1/TinyLLaVABench/tinyllava/model/multimodal_encoder/clip_encoder.py", line 25, in load_model
    self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
  File "/media/dell/data1/miniconda3/envs/glamm/lib/python3.10/site-packages/transformers/image_processing_utils.py", line 206, in from_pretrained
    image_processor_dict, kwargs = cls.get_image_processor_dict(pretrained_model_name_or_path, **kwargs)
  File "/media/dell/data1/miniconda3/envs/glamm/lib/python3.10/site-packages/transformers/image_processing_utils.py", line 335, in get_image_processor_dict
    resolved_image_processor_file = cached_file(
  File "/media/dell/data1/miniconda3/envs/glamm/lib/python3.10/site-packages/transformers/utils/hub.py", line 356, in cached_file
    raise EnvironmentError(
OSError: /media/dell/data1/pretrain_weights/SigLIP does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//media/dell/data1/pretrain_weights/SigLIP/main' for available files.

It seems the preprocessor_config.json is lacking in https://huggingface.co/bczhou/TinyLLaVA-3.1B-SigLIP/tree/main.

Luo-Z13 commented 6 months ago

The tiny-llava-v1-hf is our legacy model and it is compatible with native huggingface, as the weights have been converted to the hf implementation. If you want to load the legacy model, you should check out our model card. To use TinyLLaVA-3.1B, we have updated our readme file. The warnings can be ignored as they do not affect performance (they are hf integration warings).

Thank you very much! However, I encountered the following error during the inference. Do I need to compile the environment in TinyLLAVA?

[2024-02-24 12:46:22,640] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using a model of type siglip_vision_model to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
  File "/media/dell/data1/TinyLLaVABench/inference_tiny_llava.py", line 133, in <module>
    tokenizer, model, image_processor, context_len = load_pretrained_model(
  File "/media/dell/data1/TinyLLaVABench/tinyllava/model/builder.py", line 127, in load_pretrained_model
    model = TinyLlavaPhiForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/media/dell/data1/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3594, in from_pretrained
    no_split_modules = model._get_no_split_modules(device_map)
  File "/media/dell/data1/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1690, in _get_no_split_modules
    raise ValueError(
ValueError: TinyLlavaPhiForCausalLM does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.

Sorry for the delay. We have updated on how to install relevant enviroments and packages here.❤️

I suggest that you could post a link about SigLIP in your repo just like https://github.com/BAAI-DCAI/Bunny/blob/main/README.md#support-models to make it clearer. Thank you again for your promptness and patience.

baichuanzhou commented 6 months ago

The tiny-llava-v1-hf is our legacy model and it is compatible with native huggingface, as the weights have been converted to the hf implementation. If you want to load the legacy model, you should check out our model card. To use TinyLLaVA-3.1B, we have updated our readme file. The warnings can be ignored as they do not affect performance (they are hf integration warings).

Thank you very much! However, I encountered the following error during the inference. Do I need to compile the environment in TinyLLAVA?

[2024-02-24 12:46:22,640] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using a model of type siglip_vision_model to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
  File "/media/dell/data1/TinyLLaVABench/inference_tiny_llava.py", line 133, in <module>
    tokenizer, model, image_processor, context_len = load_pretrained_model(
  File "/media/dell/data1/TinyLLaVABench/tinyllava/model/builder.py", line 127, in load_pretrained_model
    model = TinyLlavaPhiForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/media/dell/data1/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3594, in from_pretrained
    no_split_modules = model._get_no_split_modules(device_map)
  File "/media/dell/data1/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1690, in _get_no_split_modules
    raise ValueError(
ValueError: TinyLlavaPhiForCausalLM does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.

It seems that the build_vision_tower function from tinyllava/model/multimodal_encoder/builder.py identifies the weight you provided as CLIPVisionTower and it is causing error. Try renaming the weights to ".../siglip", and it should be fixed.

Luo-Z13 commented 6 months ago

compile

I recompile the environment under TinyLLAVA, and change the path of CLIPVisionTower from https://huggingface.co/bczhou/TinyLLaVA-3.1B-SigLIP to https://huggingface.co/google/siglip-so400m-patch14-384, then the inference is OK :smile:.

Luo-Z13 commented 6 months ago

compile

I recompile the environment under TinyLLAVA, and change the path of CLIPVisionTower from https://huggingface.co/bczhou/TinyLLaVA-3.1B-SigLIP to https://huggingface.co/google/siglip-so400m-patch14-384, then the inference is OK 😄.

Could I directly use the https://huggingface.co/google/siglip-so400m-patch14-384? It seems that the visionCLIP part has been fine-tuned in the paper, but I can still get results that appear to be correct. @baichuanzhou

baichuanzhou commented 6 months ago

compile

I recompile the environment under TinyLLAVA, and change the path of CLIPVisionTower from https://huggingface.co/bczhou/TinyLLaVA-3.1B-SigLIP to https://huggingface.co/google/siglip-so400m-patch14-384, then the inference is OK 😄.

Could I directly use the https://huggingface.co/google/siglip-so400m-patch14-384? It seems that the visionCLIP part has been fine-tuned in the paper, but I can still get results that appear to be correct. @baichuanzhou

We found that it was the builder function from tinyllava/model/multimodal_encoder/builder.py that caused your error and fixed it. Our uploaded vision model was finetuned by us, and different from the google's version. So to get the results from the paper, you should use ours.

Luo-Z13 commented 6 months ago

compile

I recompile the environment under TinyLLAVA, and change the path of CLIPVisionTower from https://huggingface.co/bczhou/TinyLLaVA-3.1B-SigLIP to https://huggingface.co/google/siglip-so400m-patch14-384, then the inference is OK 😄.

Could I directly use the https://huggingface.co/google/siglip-so400m-patch14-384? It seems that the visionCLIP part has been fine-tuned in the paper, but I can still get results that appear to be correct. @baichuanzhou

We found that it was the builder function from tinyllava/model/multimodal_encoder/builder.py that caused your error and fixed it. Our uploaded vision model was finetuned by us, and different from the google's version. So to get the results from the paper, you should use ours.

Thank you, it's working properly now.

Blankit commented 6 months ago

compile

Thanks for your timely reply. Another new error:

- This IS expected if you are initializing TinyLlavaPhiForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TinyLlavaPhiForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
  File "/media/dell/data1/TinyLLaVABench/predict_RS_img_background_generation_tiny_llava.py", line 133, in <module>
    tokenizer, model, image_processor, context_len = load_pretrained_model(
  File "/media/dell/data1/TinyLLaVABench/tinyllava/model/builder.py", line 145, in load_pretrained_model
    vision_tower.load_model()
  File "/media/dell/data1/TinyLLaVABench/tinyllava/model/multimodal_encoder/clip_encoder.py", line 25, in load_model
    self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
  File "/media/dell/data1/miniconda3/envs/glamm/lib/python3.10/site-packages/transformers/image_processing_utils.py", line 206, in from_pretrained
    image_processor_dict, kwargs = cls.get_image_processor_dict(pretrained_model_name_or_path, **kwargs)
  File "/media/dell/data1/miniconda3/envs/glamm/lib/python3.10/site-packages/transformers/image_processing_utils.py", line 335, in get_image_processor_dict
    resolved_image_processor_file = cached_file(
  File "/media/dell/data1/miniconda3/envs/glamm/lib/python3.10/site-packages/transformers/utils/hub.py", line 356, in cached_file
    raise EnvironmentError(
OSError: /media/dell/data1/pretrain_weights/SigLIP does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//media/dell/data1/pretrain_weights/SigLIP/main' for available files.

It seems the preprocessor_config.json is lacking in https://huggingface.co/bczhou/TinyLLaVA-3.1B-SigLIP/tree/main.

same problem. How to solve it?

Blankit commented 6 months ago

https://huggingface.co/bczhou/TinyLLaVA-3.1B-SigLIP

image Using preprocessor_config.json from https://huggingface.co/bczhou/tiny-llava-v1-hf/blob/main/preprocessor_config.json, but the size not match. @baichuanzhou could you help to figure it out?

baichuanzhou commented 6 months ago

https://huggingface.co/bczhou/TinyLLaVA-3.1B-SigLIP

image Using preprocessor_config.json from https://huggingface.co/bczhou/tiny-llava-v1-hf/blob/main/preprocessor_config.json, but the size not match. @baichuanzhou could you help to figure it out?

Which model type are you using? tiny-llava-v1-hf is our legacy model and cannot be loaded with load_pretrained_model function. See its model card for how to run inference with it.

Blankit commented 6 months ago

Inference as Run Inference example, and use files from https://huggingface.co/bczhou/TinyLLaVA-3.1B/tree/main, it reports image.

baichuanzhou commented 6 months ago

Inference as Run Inference example, and use files from https://huggingface.co/bczhou/TinyLLaVA-3.1B/tree/main, it reports image.

Did you download the vision encoder? Please tell me how the weights are stored in your file system(e.g. their respective path names). Thanks.