Serious conclusion: LOMO does not significantly reduce GPU memory usage！

OpenLMLab / LOMO

LOMO: LOw-Memory Optimization

MIT License

979 stars 69 forks source link

Serious conclusion: LOMO does not significantly reduce GPU memory usage！ #72

Open misonsky opened 11 months ago

misonsky commented 11 months ago

Through comparative experiments, we found that what really reduces GPU memory is "torch.set_default_dtype(torch.float16)" and deepspeed. We used LLaMA-7B to conduct experiments, using { "zero_optimization":{ "stage": 0 }, "gradient_accumulation_steps": 1, "steps_per_print": 2000, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false } configuration to cancel the deepspeed function. When we do not enable mixed precision, the output of the model is fp16. This result is obviously abnormal. After our check, we found that “torch.set_default_dtype(torch.float16)” played a key role! When we remove deepspeed and “torch.set_default_dtype(torch.float16)”, according to the default configuration on wic datasets, out of memory on the 80G A100 card!. After adding "“torch.set_default_dtype(torch.float16)”, the memory is directly reduced to about 35G. According to normal mixed precision training, the author's LOMO still out of memory on the 80G A100 card!

misonsky commented 11 months ago

7703e18434dd1b04b6dc05ea9ebb5d1

This is the result of normal fine-tuning,

7703e18434dd1b04b6dc05ea9ebb5d1

This is mixed precision FT!

The above results completely use the LOMO optimizer, without deepspeed and “torch.set_default_dtype(torch.float16)”.

misonsky commented 11 months ago

This is the result of normal fine-tuning,

This is mixed precision FT!

The above results completely use the LOMO optimizer, without deepspeed and “torch.set_default_dtype(torch.float16)"

a4149f9842b6f978821c4588f165daa

this is mixed precision results.

misonsky commented 11 months ago

a9ff8d4b449be169d58ec56bb9ca3b5

This is the result of memory usage on the wic data set when we only use the LOMO optimizer, the sentence length is 512, and the batch is 1.

misonsky commented 11 months ago

1704306608514

This is the result after adding torch.set_default_dtype(torch.float16)!

misonsky commented 11 months ago

The author calls it mixed precision training, but it is not! How much memory usage LOMO can reduce needs to be strictly verified through experiments, rather than attributing the effect of reducing the number of precision bits and deepspeed to LOMO! The problem with adaLOMO is similar.

misonsky commented 11 months ago

403a1ce3849d4dbc14098879ae16523

This is the normal weight type for mixed precision models.

KaiLv69 commented 11 months ago

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO: To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory.

In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).
- Is LOMO Simply Reducing Precision to Decrease Memory Use? No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.
- Why Does Using fp32 Training Increase Memory Usage? Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.
- Reasons for Out-of-Memory (OOM) in Your Experiments: Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.
- Role of DeepSpeed in LOMO: In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.
- Have We Overclaimed LOMO's Memory Usage Reduction? No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.
Downstream Performance in the AdaLomo Paper: We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

misonsky commented 11 months ago

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO: To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory. In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use? No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage? Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments: Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO: In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction? No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper: We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

I don't question the advantages of integrated updates, but this is far from the author's ability, which is important! Mixed precision training only reduces dynamic memory usage, which is also important! If you use hugging face for normal mixed precision training you will observe different results. The author's approach actually does quantization, and although you set it to 16 bits, if you set the default to 8 bits I'd go back to lowering the memory even more. You didn't explain anything about this in the paper, which is also important.

misonsky commented 11 months ago

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO: To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory. In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use? No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage? Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments: Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO: In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction? No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper: We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

I don't question the advantages of integrated updates, but this is far from the author's ability, which is important! Mixed precision training only reduces dynamic memory usage, which is also important! If you use hugging face for normal mixed precision training you will observe different results. The author's approach actually does quantization, and although you set it to 16 bits, if you set the default to 8 bits I'd go back to lowering the memory even more. You didn't explain anything about this in the paper, which is also important.

Mixed precision only reduces dynamic memory usage. Even for comparative experiments, I think the author should clearly tell readers which module reduces memory usage. Based on LLaMA-7B, I only used a batch of 1 and a sentence length of 512, which occupies about 63G of memory.

misonsky commented 11 months ago

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO: To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory. In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use? No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage? Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments: Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO: In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction? No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper: We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

The author should give how much memory can be reduced by just the LOMO optimizer, rather than relying on setting 16-bit precision, using deepspeed. This is very confusing, if LOMO itself can do it, why use other technologies. The code is here for anyone to try and the conclusion is pretty clear.

misonsky commented 11 months ago

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO: To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory. In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use? No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage? Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments: Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO: In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction? No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper: We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

I know the author's idea very well and the author does not need to explain it further. It is impossible to achieve the effect claimed by the author in the paper if we rely solely on fusion and update ideas. Remove the global 16-bit settings, use the LOMO optimizer alone, use the hugging face trainer for mixed precision training, modify its training logic, and use the author's ideas, which is easy to verify!

misonsky commented 11 months ago

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO: To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory. In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use? No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage? Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments: Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO: In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction? No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper: We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

The author should give how much memory can be reduced by just the LOMO optimizer, rather than relying on setting 16-bit precision, using deepspeed. This is very confusing, if LOMO itself can do it, why use other technologies. The code is here for anyone to try and the conclusion is pretty clear.

Or you should clearly provide a comparative experiment to show who played a role in reducing memory.

misonsky commented 11 months ago

Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.

On Reducing Memory Usage with LOMO: To understand how LOMO achieves reduced memory usage, it's essential to first comprehend the memory allocation in a mixed-precision training context. Taking Adam as an example, the GPU memory typically holds a copy of fp16 weights for forward and backward operations. Additionally, there are fp32 momentum, variance, and a weight copy (for updating weights in fp16 is insufficient). After backward, and before updating parameters, fp16 gradients are also stored in the memory. Some training frameworks, like DeepSpeed, convert these gradients to fp32 for weight updates. Also, temporary variables like activation values are stored in memory. In LOMO's approach, as mentioned in our paper, we eliminate the fp32 copies of momentum, variance, and weight backup. To further minimize memory usage, during the backward process, we immediately update each parameter with its calculated gradient and then set this gradient to None, a technique referred to as fused backward. This changes the gradient's memory requirement from O(n) to O(1). Hence, when training with LOMO, the primary memory consumption is due to fp16 weights and activation values. During weight updating in the fused backward, we convert weights and gradients to fp32 individually (instead of converting all the model's parameters), enhancing the precision of weight updates. The memory cose of this convertion is O(1) instead of O(n).

Is LOMO Simply Reducing Precision to Decrease Memory Use? No. The Adam baseline we compared against also employs mixed-precision training, meaning both forward and backward calculations are conducted in fp16.

Why Does Using fp32 Training Increase Memory Usage? Firstly, using fp32 for training Large Language Models (LLMs) is not a commonly adopted method. The reason being, fp16 calculations are significantly faster than fp32, and the loss in precision is minimal (discussed in Scaling Language Models: Methods, Analysis & Insights from Training Gopher). When training with LOMO, the memory is primarily occupied by fp16 weights and activation values. Thus, the increase in memory usage when switching parameters to fp32 is within expected.

Reasons for Out-of-Memory (OOM) in Your Experiments: Besides weights, when batch size and sequence length are large, memory consumption is predominantly driven by activation values. The memory used by activation values can be reduced through gradient checkpointing, although this is not the primary focus of LOMO.

Role of DeepSpeed in LOMO: In LOMO, DeepSpeed mainly facilitates parameter partitioning. As LOMO lacks optimizer states and gradient memory usage is at O(1), DeepSpeed's zero1-2 stages do not significantly impact LOMO's efficiency.

Have We Overclaimed LOMO's Memory Usage Reduction? No, through practical testing, we have indeed been able to fine-tune the entire LLaMa-7b on a 1x24GB setup and LLaMa-65b on an 8x24GB setup. The code used for these tests is publicly available.

Downstream Performance in the AdaLomo Paper: We trained LLaMa-1 on the Alpaca-GPT4 dataset and tested it across various benchmarks to validate AdaLomo's efficacy in instruction fine-tuning scenarios. The blog post you referenced trained and tested directly on the gsm8k dataset with LLaMa-2, which isn't directly comparable. You might consider comparing against the findings in "How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources", particularly the resultsin Table 3. The performance achieved using AdamW+Alpaca-GPT4 is on par with ours (and not reported as higher). Therefore, there is no intentional understatement of baseline performance in our paper.

I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them.

The author should give how much memory can be reduced by just the LOMO optimizer, rather than relying on setting 16-bit precision, using deepspeed. This is very confusing, if LOMO itself can do it, why use other technologies. The code is here for anyone to try and the conclusion is pretty clear.

Or you should clearly provide a comparative experiment to show who played a role in reducing memory.

For example, you should compare deepspeed+fp16+... and deepspeed+fp16+...+lomo. Just like if you use BERT, you should provide the results of BERT and the results of your BERT+anything.

misonsky commented 11 months ago

Thank you for your patient responses and pleasant discussion, we just hope the results are rigorous rather than vague.

misonsky commented 11 months ago

https://github.com/OpenLMLab/LOMO/issues/47#issuecomment-1877255939

KaiLv69 commented 10 months ago

For example, you should compare deepspeed+fp16+... and deepspeed+fp16+...+lomo. Just like if you use BERT, you should provide the results of BERT and the results of your BERT+anything.

yes, we compared deepspeed+fp16+adamw and deepspeed+fp16+lomo, didn't we? @misonsky

misonsky commented 10 months ago

I thought the author would listen and make corrections but actually quite the opposite.

1704976915771

1704976972693

1704976995931

1704977023061

1704977051582

Which experimental results can support the author's conclusion? Did the author tell readers which module can reduce memory? Is it torch.set_default_dtype(torch.float16), or gradient checkpointing? or LoMO? previous work have evaluated that calculation graphs occupy almost more than 50% of the memory, but the author's conclusion is exactly the opposite.