-
### Problem Description
On Llama3 70B Proxy Model, the training stalls & gpucore dumps. The gpucore dumps are 41GByte per GPU thus i am unable to send it. Probably easier for yall to reprod this er…
-
### 🐛 Describe the bug
Invoking a compiled model under FlopCounterMode context results in a slower compiled model.
If we run our benchmark _before_ the model is instrumented with FlopCounterMode, …
-
### Issue type
Performance
### Have you reproduced the bug with TensorFlow Nightly?
No
### Source
source
### TensorFlow version
tf 2.4.1
### Custom code
Yes
### OS platfo…
-
In modeling_qwen2_vl.py https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L343
The attention_mask is set for each frame, when not set the f…
-
Hi
I have been running some tests and your model reports a FLOPS of around 4 G
The original paper and the keras implementation reports 3.8 G
Any idea why the difference?
-
When I use `calculate_flops` to calculate flops of a local model (e.g. `openai/clip-vit-large-patch14-336` downloaded locally), the result is smaller than the FLOPs calculated manually (use the flops …
-
I have a trained pointrend model implemented by MMseg, but when use get_flops.py to calculate FLOPS, it have the following some error.
![image](https://github.com/open-mmlab/mmsegmentation/assets/761…
-
### Problem Description
Even with `NVTE_USE_HIPBLASLT=1` & Installing TE while inside the container instead of through `Dockerfile` as suggested by https://github.com/ROCm/TransformerEngine/issues/…
-
I am trying to train LLama-7B on 8xH100-80GB (HBM3),
### Baseline
When running _without_ activation checkpointing and _without_ fp8, everything runs smoothly:
```yaml
distributed:
fsdp_type:…
-
### Problem Description
Llama3 8B FP8 OOMs at the same batch size as BF16. I need to decrease the batch size to `2` for it to not OOM. At batch size 2, TE FP8 is **21% slower** than torch compile B…