issues
search
NVIDIA
/
TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k
stars
256
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
initialize_ub failed: transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:208 in function create_communicator_grouped2: CUDA Error: operation not supported
#991
liuhatry
closed
7 hours ago
2
Calling backward(retain_graph=True) multiple times with TE Layer does not work
#990
kshitij12345
opened
1 day ago
0
Hang when training with MPI with --tp-comm-overlap turned on
#989
lwmlyy
closed
10 hours ago
1
[MoE][Pytorch]Fix size mismatch error in fp8 transpose.
#988
Victarry
opened
1 day ago
0
Parallel build with limited resource
#987
phu0ngng
opened
1 day ago
1
[PyTorch] Fixing hang in `initialize_ub()` for multi-node runs after PR901 removal of MPI-dependence
#986
denera
opened
2 days ago
0
Training core dump in megatron-lm with tp-comm-overlap.
#985
XLzed
opened
2 days ago
4
[PyTorch] Remove implicit padding and unpadding in `GroupedLinear`
#984
yaox12
opened
2 days ago
2
[Core] Fix bug when selecting tuned RMSNorm kernels
#983
timmoon10
closed
2 days ago
2
[PyTorch] How to restore fp8 amp training from checkpoint
#982
alexdremov
opened
3 days ago
0
Parallel build with limited resource
#981
phu0ngng
closed
1 day ago
1
add compare_update
#980
webber26232
closed
3 days ago
0
[pre-commit.ci] pre-commit suggestions
#979
pre-commit-ci[bot]
opened
3 days ago
0
Building wheel error during installation
#978
Drzhishi
opened
4 days ago
1
[WIP] [PyTorch] Support dtype casting in fused adam
#977
Wong4j
opened
4 days ago
0
Get Stuck at Building Wheel
#976
kingformatty
opened
1 week ago
1
Update FE to 1.5.2 and miscellaneous fixes
#975
cyanguwa
closed
4 days ago
5
Add test for building without support for any DL frameworks
#974
timmoon10
opened
1 week ago
1
[PyTorch] Disable THD tests on architectures lower than sm90
#973
cyanguwa
closed
1 week ago
1
no boost in performance with Ada GPU
#972
saurabh-kataria
opened
1 week ago
0
[PyTorch] Disable THD test on architectures lower than sm90
#971
cyanguwa
closed
1 week ago
2
[PyTorch] Runtime lookup for CUDA Driver API calls in Userbuffers
#970
denera
closed
2 days ago
10
Script to run pre-commit hooks locally
#969
ksivaman
closed
1 week ago
0
[PyTorch] Fix invalid import in test for context parallelism
#968
timmoon10
closed
1 week ago
0
Replace functools cache with lru_cache
#967
timmoon10
closed
1 week ago
1
tp_overlap need tensor parallel is equal world size ?
#966
kuangdao
opened
1 week ago
4
How to cast 16/32-bit to FP8?
#965
mxjmtxrm
opened
1 week ago
3
[JAX] Add experimental internal used THD(packed) fused attn API
#964
zlsh80826
closed
2 days ago
2
[Paddle] Fix forward and backward logic of te.Linear(parallel_mode='column') to adapt DiT of PaddleMIX
#963
yumin066
opened
1 week ago
4
nan loss when training in fp8 with rotary embedding
#962
saurabh-kataria
opened
1 week ago
2
Why is the result of context-parallel DotProductAttention influenced by the random seed?
#961
LitPrice
opened
1 week ago
0
[C/PyTorch] Add support for bottom-right-diagonal causal mask
#960
cyanguwa
closed
2 days ago
4
create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations.
#959
anderson101866
opened
1 week ago
1
TransformerEngine setup.py fails with Python 3.8
#958
skydoorkai
closed
1 week ago
2
[Paddle][CUDAGraph] 175B GPT-3 Hybrid-Parallel Training with CUDAGraph
#957
eee4017
closed
2 days ago
6
[Paddle] Add deterministic option in DotProductAttention
#956
Wong4j
opened
1 week ago
9
AssertionError: CublasLt version 12.1.3.x or higher required for FP8 execution on Ada.
#955
saurabh-kataria
closed
1 week ago
2
TransformerEngine build fail with Conda
#954
TeddLi
closed
1 week ago
4
NaN loss issues when I switch to the Transformer Engine TransformerLayer from pytorch layer
#953
jasonkrone
opened
1 week ago
0
AttnFuncWithCP can use less memory
#952
i4never
opened
2 weeks ago
0
Lower memory usage during AttnFuncWithCP.forward
#951
i4never
opened
2 weeks ago
2
Pure bfloat16 vs. mixed precision bfloat16: what's recommended?
#950
jasonkrone
closed
1 week ago
1
Fix compilation bug with CUDA 12.1
#949
Edenzzzz
closed
1 week ago
2
how to use FusedRMSNorm?
#948
EthanChen1234
opened
2 weeks ago
1
Why use two streams for context parallel
#947
Edenzzzz
opened
2 weeks ago
2
[TE/JAX] Prototype for New XLA Custom Calls with FFI
#946
phu0ngng
opened
2 weeks ago
0
[PyTorch] Add option to pass kwargs to CUDA graph module
#945
timmoon10
opened
2 weeks ago
1
Expose `rotary_base` as an arg instead of hardcoding
#944
sudhakarsingh27
opened
2 weeks ago
1
Update required CMake version to 3.25
#943
timmoon10
opened
2 weeks ago
2
Improve JAX build tool
#942
phu0ngng
closed
1 week ago
2
Next