Closed netrookiecn closed 6 months ago
try cpu offload if u have 1T RAM
i tried zero3 and i have 1.2T memory.set batch_size 1. but it fails again, do you have any suggestions?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.50 GiB (GPU 3; 79.35 GiB total capacity; 77.08 GiB already allocated; 415.19 MiB free; 77.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
@BeiQingLu1113
try cpu offload if u have 1T RAM
i tried zero3 and i have 1.2T memory.set batch_size 1. but it fails again, do you have any suggestions?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.50 GiB (GPU 3; 79.35 GiB total capacity; 77.08 GiB already allocated; 415.19 MiB free; 77.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
@BeiQingLu1113
solved by changing some params
try cpu offload if u have 1T RAM
i tried zero3 and i have 1.2T memory.set batch_size 1. but it fails again, do you have any suggestions? torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.50 GiB (GPU 3; 79.35 GiB total capacity; 77.08 GiB already allocated; 415.19 MiB free; 77.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF @BeiQingLu1113
solved by changing some params
Could you please paste your parameters here? I have encountered the same problem. Thank you!
try cpu offload if u have 1T RAM
i tried zero3 and i have 1.2T memory.set batch_size 1. but it fails again, do you have any suggestions? torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.50 GiB (GPU 3; 79.35 GiB total capacity; 77.08 GiB already allocated; 415.19 MiB free; 77.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF @BeiQingLu1113
solved by changing some params
Could you please paste your parameters here? I have encountered the same problem. Thank you!
use 1T memmory and deepspeed zero3 offload , with micro batchsize=1 and max length <2048 , then try to experiement
how can full parameters should be finetuned rather than lora? when using deepspeed zero stage3 , out of memory exists