If not set a fit bucket size in the distributed optimizer, memory waste will result. Memory loss is sometimes high but invisible, for example, a 10GB memory penalty for each GPU on a misconfigured gpt-7b. I think reporting a warning when the bucket utilization is low is a solution, and I submitted my code as reference.
If not set a fit bucket size in the distributed optimizer, memory waste will result. Memory loss is sometimes high but invisible, for example, a 10GB memory penalty for each GPU on a misconfigured gpt-7b. I think reporting a warning when the bucket utilization is low is a solution, and I submitted my code as reference.