Open maxruby opened 1 month ago
Thanks for pointing out!
I will take a look!
By the way, I calculated the memory requirements for molmo-7b, and I don't see why our 2x NVIDIA RTX A5000 GPUs (48 GB VRAM total) should not be sufficient for inference with molmo. Any comments on this?
In the small text they say "When performing inference, expect to add up to an additional 20%".
Explanation: Inference is a little bit more than just gradient calculation. All the token juggling and stuff needs some room too.
So it's probably a little bit too much unfortunately.
I tried with 3 PCs now, and can't seem to reproduce your error
Tomorrow I have access to a cuda 12.1 system. then I'll try again!
Again, thanks for answering and testing so quickly :)
For your information, our server has CUDA version 12.2 so it would be nice to check that pytorch 12.1 is not the issue here.
GPU | Name | Persistence-M | Bus-Id | Disp.A | Volatile Uncorr. ECC | Fan | Temp | Perf | Pwr:Usage/Cap | Memory-Usage | GPU-Util | Compute M. | MIG M. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NVIDIA RTX A5000 | On | 00000000:01:00.0 | Off | Off | 30% | 39C | P8 | 23W / 230W | 12MiB / 24564MiB | 0% | Default | N/A |
1 | NVIDIA RTX A5000 | On | 00000000:61:00.0 | Off | Off | 30% | 35C | P8 | 16W / 230W | 84MiB / 24564MiB | 0% | Default | N/A |
Having said that I could agree with you that perhaps this is due to insufficient allocation or incorrect use of GPU VRAM to support loading the tensors. The one thing I find confusing is that I am able to do inference without trouble using the following models on the same system via ollama:
Model
arch llama
parameters 70.6B
quantization Q4_0
context length 131072
embedding length 8192
qwen2.5:72b-instruct (47 GB VRAM)
Model
arch qwen2
parameters 72.7B
quantization Q4_K_M
context length 32768
embedding length 8192
I wonder then what GPUs and VRAM are you using to run the molmo-7b Model here? Your information might be useful to compare and figure out if we are not fulfilling the minimal requirements.
Finally, is it also possible that the GPU VRAM workload is not distributed across GPUs when loading via pytorch due to some missing configuration?
Thanks for sharing your script to run the 4-bit quantized molmo-7b. Unfortunately, I am unable to run it on my server (Ubuntu 22.04 with 2x RTX A5000 48 GB VRAM) - the error trace is below. I wonder whether you have any suggestions as to what I could do to get it running :) I followed exactly your instructions in the README, the only point is that I have CUDA Version: 12.2 and not 12.1 or 12.4 so I installed torch and torchvision with
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
Thanks in advance for any tips.