NVIDIA TransformerEngine issues

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Apache License 2.0

1.61k stars 256 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

initialize_ub failed: transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:208 in function create_communicator_grouped2: CUDA Error: operation not supported

#991 liuhatry closed 7 hours ago
2
Calling backward(retain_graph=True) multiple times with TE Layer does not work

#990 kshitij12345 opened 1 day ago
0
Hang when training with MPI with --tp-comm-overlap turned on

#989 lwmlyy closed 10 hours ago
1
[MoE][Pytorch]Fix size mismatch error in fp8 transpose.

#988 Victarry opened 1 day ago
0
Parallel build with limited resource

#987 phu0ngng opened 1 day ago
1
[PyTorch] Fixing hang in `initialize_ub()` for multi-node runs after PR901 removal of MPI-dependence

#986 denera opened 2 days ago
0
Training core dump in megatron-lm with tp-comm-overlap.

#985 XLzed opened 2 days ago
4
[PyTorch] Remove implicit padding and unpadding in `GroupedLinear`

#984 yaox12 opened 2 days ago
2
[Core] Fix bug when selecting tuned RMSNorm kernels

#983 timmoon10 closed 2 days ago
2
[PyTorch] How to restore fp8 amp training from checkpoint

#982 alexdremov opened 3 days ago
0
Parallel build with limited resource

#981 phu0ngng closed 1 day ago
1
add compare_update

#980 webber26232 closed 3 days ago
0
[pre-commit.ci] pre-commit suggestions

#979 pre-commit-ci[bot] opened 3 days ago
0
Building wheel error during installation

#978 Drzhishi opened 4 days ago
1
[WIP] [PyTorch] Support dtype casting in fused adam

#977 Wong4j opened 4 days ago
0
Get Stuck at Building Wheel

#976 kingformatty opened 1 week ago
1
Update FE to 1.5.2 and miscellaneous fixes

#975 cyanguwa closed 4 days ago
5
Add test for building without support for any DL frameworks

#974 timmoon10 opened 1 week ago
1
[PyTorch] Disable THD tests on architectures lower than sm90

#973 cyanguwa closed 1 week ago
1
no boost in performance with Ada GPU

#972 saurabh-kataria opened 1 week ago
0
[PyTorch] Disable THD test on architectures lower than sm90

#971 cyanguwa closed 1 week ago
2
[PyTorch] Runtime lookup for CUDA Driver API calls in Userbuffers

#970 denera closed 2 days ago
10
Script to run pre-commit hooks locally

#969 ksivaman closed 1 week ago
0
[PyTorch] Fix invalid import in test for context parallelism

#968 timmoon10 closed 1 week ago
0
Replace functools cache with lru_cache

#967 timmoon10 closed 1 week ago
1
tp_overlap need tensor parallel is equal world size ?

#966 kuangdao opened 1 week ago
4
How to cast 16/32-bit to FP8?

#965 mxjmtxrm opened 1 week ago
3
[JAX] Add experimental internal used THD(packed) fused attn API

#964 zlsh80826 closed 2 days ago
2
[Paddle] Fix forward and backward logic of te.Linear(parallel_mode='column') to adapt DiT of PaddleMIX

#963 yumin066 opened 1 week ago
4
nan loss when training in fp8 with rotary embedding

#962 saurabh-kataria opened 1 week ago
2
Why is the result of context-parallel DotProductAttention influenced by the random seed?

#961 LitPrice opened 1 week ago
0
[C/PyTorch] Add support for bottom-right-diagonal causal mask

#960 cyanguwa closed 2 days ago
4
create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations.

#959 anderson101866 opened 1 week ago
1
TransformerEngine setup.py fails with Python 3.8

#958 skydoorkai closed 1 week ago
2
[Paddle][CUDAGraph] 175B GPT-3 Hybrid-Parallel Training with CUDAGraph

#957 eee4017 closed 2 days ago
6
[Paddle] Add deterministic option in DotProductAttention

#956 Wong4j opened 1 week ago
9
AssertionError: CublasLt version 12.1.3.x or higher required for FP8 execution on Ada.

#955 saurabh-kataria closed 1 week ago
2
TransformerEngine build fail with Conda

#954 TeddLi closed 1 week ago
4
NaN loss issues when I switch to the Transformer Engine TransformerLayer from pytorch layer

#953 jasonkrone opened 1 week ago
0
AttnFuncWithCP can use less memory

#952 i4never opened 2 weeks ago
0
Lower memory usage during AttnFuncWithCP.forward

#951 i4never opened 2 weeks ago
2
Pure bfloat16 vs. mixed precision bfloat16: what's recommended?

#950 jasonkrone closed 1 week ago
1
Fix compilation bug with CUDA 12.1

#949 Edenzzzz closed 1 week ago
2
how to use FusedRMSNorm?

#948 EthanChen1234 opened 2 weeks ago
1
Why use two streams for context parallel

#947 Edenzzzz opened 2 weeks ago
2
[TE/JAX] Prototype for New XLA Custom Calls with FFI

#946 phu0ngng opened 2 weeks ago
0
[PyTorch] Add option to pass kwargs to CUDA graph module

#945 timmoon10 opened 2 weeks ago
1
Expose `rotary_base` as an arg instead of hardcoding

#944 sudhakarsingh27 opened 2 weeks ago
1
Update required CMake version to 3.25

#943 timmoon10 opened 2 weeks ago
2
Improve JAX build tool

#942 phu0ngng closed 1 week ago
2