Open magicleo opened 1 year ago
For inference, you can use accelerate for that I think; Check https://github.com/huggingface/accelerate/issues/769
@Muennighoff Thank you very much for your reply.
I tried code like below
model = SentenceTransformerSpecb( "bigscience/sgpt-bloom-7b1-msmarco", cache_folder="/mnt/storage/agtech/modelCache", ) accelerator = Accelerator() model = accelerator.prepare(model)
when run model= accelerator.prepare(model)
I got CUDA out of memory,still only use first gpu.
Any suggest?
I have 2 GPUS, each on has 24G memory. when I run code below
model = SentenceTransformerSpecb( "bigscience/sgpt-bloom-7b1-msmarco", cache_folder = "/mnt/storage/agtech/modelCache", ) query_embeddings = model.encode(queries, is_query=True)
got OutOfMemoryError, it only use the first GPU. Can it load the model on two gpus?OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 22.03 GiB total capacity; 21.27 GiB already allocated; 50.94 MiB free; 21.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF