-
## Description
Currently we have been unable to reproduce the schedule free adamw results with JAX.
There seem to be differences between the optax implementation of schedule-free adamw and the pyto…
-
This is not bug report.
Just report.
It's not unsloth error was not the cause, the title error occurred at the start of training and accelerate seemed to be affecting it.
https://github.com/hu…
-
When I put in the trl library I get this error:
is not a valid OptimizerNames, please select one of ['adamw_hf', 'adamw_torch', 'adamw_torch_fused', 'adamw_torch_xla', 'adamw_torch_npu_fused', 'adam…
-
## ❓ Questions and Help
It is to my understanding that Adam should use more memory than SGD because it keeps track of more parameters. However, when I look at my profiles between Adam and SGD optim…
-
Hi, thank you for developing and maintaining this awesome library and ecosystem!
I'm not entirely sure but could it be that the documentation for the `AdamW` optimizer is a bit misleading? If I und…
-
### Describe the bug
I'm trying to implement the recipe https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/transducer but the WER and train loss are very high. After runn…
-
I tried with Prodigy optimizer, and it is exactly as you wrote - reaaallly slow convergence. I trained the model for 120 epochs and I could easily train another 60 epochs. I want to giva a try with Ad…
-
### Feature request
Hi thanks for the library! It would be great if the optimizers can be run on CPU. For example, I would like to try adamw_8bit to full-finetune a 8B model on a 24GB GPU card (RTX40…
-
Docs: https://pytorch.org/docs/2.4/distributed.optim.html#torch.distributed.optim.ZeroRedundancyOptimizer
```diff
- optimizer = torch.optim.AdamW(model.parameters(), lr=args.lr)
+ optimizer…
-
**Describe the bug**
AdamW implementation (see [here](https://github.com/NVIDIA/apex/blob/a7de60e57f0534266841e1733262601ad76aaa74/csrc/multi_tensor_adam.cu#L333)) does not truly decouple the weight…