The first stage takes about 50 hours on a single 4x NVIDIA A100 machine (global batch size 128 with gradient
accumulation). And for the large scale GUI data training, we use 112 NVIDIA H100 GPUs and finish the
training in about 6 hours (global batch size 448).
Can you please clarify what are the inference time hardware requirements? Any chance of running this on CPU?
Overall, it's built on LLava with slight adaptations (mainly about input image processing), so it's definitely possible to run on CPU (take Ollama as a reference). I remember 4bit llava can run very smoothly on my laptop.
Hello, and thank you for the excellent work!
In the paper it says:
Can you please clarify what are the inference time hardware requirements? Any chance of running this on CPU?
Thanks again!