Closed ninjasaid2k closed 1 year ago
@ninjasaid2k With simple acceleration (i.e., half precision, flash attention), Composer requires approximately 28GB of GPU memory for inference.
It would be possible to further reduce the memory consumption with, e.g., uint8 quantization and TensorRT to make it run on consumer grade GPUs.
What would the hardware requirements be for the the 5B parameters model? would it be possible on consumer grade hardware?