SiyuanHuang95 / ManipVQA

[IROS24 Oral]ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
62 stars 3 forks source link

Fine-tuning and inference of ManipVQA on less GPU resources #3

Closed hyang1974 closed 3 months ago

hyang1974 commented 3 months ago

Very attractive work! We are planning to investigate visual large models for industrial robotics use cases and I think your work is a very good reference. But I found your fine-tuning scripts are using slurm and probably running on clusters with 8 A100? I am wondering such fine-tuning/inference can be implemented with less GPU resource, eg, a desktop machine with single GPU? especially for inference?

By the way, I saw you are Ph.D. candidate at SJTU, we are close. We may discuss on this offline if you have time. :)

SiyuanHuang95 commented 3 months ago
  1. Yes, the training is conducted on a slurm with 8 GPUS
  2. You can reduce the needs when you use some techniques like LoRA, since we are here use full-finetune. You can refer to the https://github.com/Alpha-VLLM/LLaMA2-Accessory, and we use their codebase
  3. When inference, use quant technique would need
  4. Yes, I am currently a Ph.D student in SJTU, but I spend my most time in Shanghai AI Lab, which is located in Xuhui, shanghai
SiyuanHuang95 commented 3 months ago

Hi, we provide the LoRA usage and quant usage in our another similar project, you can refer to A3VLM, we also provide the 7B model in that project, check it if you need!

SiyuanHuang95 commented 2 months ago

@hyang1974 We have an 7B version just released, which only use 1 GPU to infer