AIDC-AI / Ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B
Apache License 2.0
360 stars 19 forks source link

If it is possible to run inference with OVIS 1.6 on a single 4090 GPU? #22

Open Raven625 opened 1 week ago

Raven625 commented 1 week ago

Could anyone please advise if it is possible to run inference with OVIS 1.6 on a single 4090 GPU? After loading the model, it appears to consume approximately 20GB of VRAM. I attempted an inference, but the demo exited due to insufficient memory. Are there any solutions to this issue?

leave-zym commented 1 week ago

The same question, is there a quantitative way of reasoning

thunder95 commented 1 week ago

same issue

FennelFetish commented 1 week ago

Offload some layers of the visual tokenizer to the CPU using a device map. I use this function to generate the device map:

    def makeDeviceMap(llmGpuLayers: int, visGpuLayers: int) -> dict:
        llmGpuLayers = min(llmGpuLayers, 41)
        visGpuLayers = min(visGpuLayers, 26)

        deviceMap = dict()
        cpu = "cpu"
        cuda = 0

        deviceMap["llm.model.embed_tokens"] = cuda
        deviceMap["llm.model.norm"] = cuda
        deviceMap["llm.lm_head.weight"] = cuda
        deviceMap["vte.weight"] = cuda

        deviceMap["llm.model.layers.0"] = cuda
        for l in range(1, llmGpuLayers):
            deviceMap[f"llm.model.layers.{l}"] = cuda
        for l in range(llmGpuLayers, 41):
            deviceMap[f"llm.model.layers.{l}"] = cpu
        deviceMap["llm.model.layers.41"] = cuda

        deviceMap["visual_tokenizer"] = cuda
        deviceMap["visual_tokenizer.backbone.vision_model.encoder.layers.0"] = cuda
        for l in range(1, visGpuLayers):
            deviceMap[f"visual_tokenizer.backbone.vision_model.encoder.layers.{l}"] = cuda
        for l in range(visGpuLayers, 26):
            deviceMap[f"visual_tokenizer.backbone.vision_model.encoder.layers.{l}"] = cpu
        deviceMap["visual_tokenizer.backbone.vision_model.encoder.layers.26"] = cuda

        # print("mkDeviceMap:")
        # for k, v in device_map.items():
        #     print(f"{k} -> {v}")

        return deviceMap

It works on my 4090 with arguments of 41 and 6:

        self.model = AutoModelForCausalLM.from_pretrained(
            modelPath,
            torch_dtype=torch.bfloat16,
            multimodal_max_length=8192,
            #attn_implementation='flash_attention_2',
            device_map=self.makeDeviceMap(41, 6),
            trust_remote_code=True
        )
nmandic78 commented 6 days ago

I run their HF demo snippet (https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B) on 3090 without issues. Ubuntu, ~500Mb VRAM in use before loading the model. ~21.7Gb during inference. image

And it is very good!