When I tried to load the llava-qwen72B model, I encountered an out-of-memory issue on the H800 graphics card. It seems that this framework assigns a complete model to each GPU. How can I slice the model so that it doesn't cause an out-of-memory problem?
When I tried to load the llava-qwen72B model, I encountered an out-of-memory issue on the H800 graphics card. It seems that this framework assigns a complete model to each GPU. How can I slice the model so that it doesn't cause an out-of-memory problem?