-
Refer to https://swift.readthedocs.io/zh-cn/latest/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.html
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", …
-
I download all the dataset according to the instruction, but when I run `sh scripts/train_scanrefer_mcln_sp.sh`, I encoutered this fault. It seemed like the code you provided has some problems.
```…
-
Apparently there is no reason to use paged adam instead of the 8bit version. We should replace it.
Also, full finetune single device should use paged adam, instead of adamw, for better memory.
F…
-
### 🐛 Describe the bug
i'd like to compile my optimizer but am hitting recompilation issues. i wrap my LR in a tensor, but it seems like beta1/beta2 may need similar treatment (based on type annota…
-
Using liger kernels and nefttune, the system consumes 3 gigabytes of ram with AdamW, meanwhile with grokadamw, the system uses up the entire 12 gigabytes of ram in a google colab enviroment and crashe…
-
Hi, Rui, I saw AdamW optimizer in openfedllm's paper, but I didn't find it in the code of repo.
-
### 🐛 Describe the bug
512M parameters
Mostly vanilla LM transformer. FlashAttention 2.4.2, PyTorch 2.2.0. Uses both FA and FlashRotary.
Dtype: bf16
Nvidia A40. single-GPU
Unfused: 85 TFLOPS
F…
ad8e updated
6 months ago
-
### 🐛 Describe the bug
When training a large model on H100s, we are seeing an illegal memory access error when using AdamW `fused=True`. I suspect the root cause may be related to https://github.co…
-
I noticed that when training RDM, we need to set args.cosine_lr=True to initialize the scheduler in engine_rdm.py. However, the instructions given in the readme defaults to args.cosine_lr=False. I am …
-
I am installing for the RTX 6000 Ada. I wanted to optimize for that system to run FP8. I follow the [commands](https://azure.github.io/MS-AMP/docs/getting-started/installation/#install-from-source) to…