issues
search
NVIDIA
/
TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.99k
stars
331
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
[Draft] Introduce NVSHMEM based communication API for pytorch
#1346
gdengk
opened
3 days ago
0
Fix cuda graph capture for grouped gemm
#1345
xrennvidia
opened
4 days ago
1
How to setup TP Overlap configs
#1344
TJ-Solergibert
opened
4 days ago
0
[PyTorch] Adding TP overlap support for `te.Linear` with `parallel_mode="column"`
#1343
denera
opened
5 days ago
1
[Core] Add function to convert container to string
#1342
timmoon10
closed
4 days ago
1
[PyTorch] Bugfix for wgrad bulk overlap conflict when dgrad overlap is reduce-scatter
#1341
denera
opened
1 week ago
2
Update list of CI users
#1340
timmoon10
opened
1 week ago
1
[Common] Moved framework agnostic THD kernels to common.
#1339
mgoldfarb-nvidia
closed
12 hours ago
8
Debug nightly docs
#1338
timmoon10
opened
1 week ago
1
[C/JAX] Comm+GEMM Overlap API for TE/JAX
#1337
denera
opened
1 week ago
0
the max error of moe_permute/unpermute.grad could reach 3.6e+00
#1336
NiuMa-1234
opened
1 week ago
1
[PyTorch] Store module extra state in tensor
#1335
timmoon10
opened
1 week ago
1
[PyTorch] Fix multiple calls to saved_tensors in CP attention
#1334
ksivaman
closed
1 week ago
1
Use `CMAKE_CURRENT_SOURCE_DIR` instead of `CMAKE_SOURCE_DIR`
#1333
kmaehashi
closed
1 week ago
0
[TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container)
#1332
erhoo82
opened
1 week ago
3
[JAX] WIP Added L0 Distributed Tests
#1331
phu0ngng
opened
1 week ago
0
[Dummy] Testing branch for #1326
#1330
timmoon10
closed
1 week ago
0
[PyTorch] Integration test for Megatron-LM
#1329
timmoon10
closed
5 days ago
2
[PyTorch] Fix GQA error message
#1328
cyanguwa
closed
5 days ago
1
[COMMON/JAX] Support sliding window on THD format
#1327
zlsh80826
opened
2 weeks ago
2
[PyTorch] Remove special handling for FP8 params in FP8 recipe infrastructure
#1326
timmoon10
closed
1 week ago
3
Fix an int conversion error
#1325
jennifgcrl
closed
1 week ago
1
Build with uv instead of just pip
#1324
jennifgcrl
opened
2 weeks ago
1
TransformerEngine doesn't work with uv
#1323
jennifgcrl
opened
2 weeks ago
2
Convert non-kernel cuda files to cpp
#1322
ksivaman
closed
2 weeks ago
2
nemo llm pretrain raised Exception: No dot product attention support for the provided inputs
#1321
ycchenzheng
closed
2 weeks ago
1
[PyTorch] Fix ONNX export bug with operation-based API
#1320
timmoon10
closed
1 week ago
1
[TE/JAX] XLA FFI calls for Softmax and FusedAttnBackward
#1319
huanghua1994
closed
1 week ago
5
How can I use fp8_gemm to realize the function of "torch.mm()"?
#1318
duomicoding
opened
2 weeks ago
0
[bug] Failed to load pretrained model with huggingface transformers
#1317
kehuanfeng
opened
2 weeks ago
2
Update list of CI users
#1316
timmoon10
closed
2 weeks ago
0
[C] Normalization Refactor + Adding CUDNN backend
#1315
phu0ngng
opened
2 weeks ago
0
[C] Separating cudnn common utils from fused_attn
#1314
phu0ngng
closed
2 weeks ago
2
[JAX] Added prepare phase for the FusedAttnForwardFFI
#1313
phu0ngng
closed
2 weeks ago
2
Linear does not support TP comm overlap for Column Parallel mode
#1312
parthmannan
opened
2 weeks ago
0
TP communication overlap: enable the overlap between GEMM chunk at Ho…
#1311
erhoo82
opened
3 weeks ago
1
[TE/JAX] XLA FFI calls for three cast transpose functions
#1310
huanghua1994
closed
2 weeks ago
1
about PyTorch 2.5 install te
#1309
klhhhhh
closed
2 weeks ago
7
Improving communication overlap for the case of multi kernel queue usage
#1308
youngeunkwon0405
opened
3 weeks ago
10
[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap)
#1307
denera
opened
3 weeks ago
2
`intra_domain_ranks` is not defined in one of the execution paths.
#1306
erhoo82
closed
3 weeks ago
1
[PyTorch] Missing intra-domain ranks list when initializing Userbuffers with data parallelism
#1305
denera
closed
3 weeks ago
0
[JAX] Fix for Disable FusedAttn with FFI by default
#1304
phu0ngng
closed
3 weeks ago
1
[QUESTION] Does TP overlap support variable sequence length?
#1303
wplf
closed
2 weeks ago
5
Update cudnn-frontend to 1.8.0
#1302
cyanguwa
closed
3 weeks ago
0
[JAX] Add back the xla deterministic flag
#1301
zlsh80826
closed
2 weeks ago
4
[PyTorch] Add heuristics for intializing FP8 params
#1300
timmoon10
opened
3 weeks ago
2
Offloading example
#1299
sanandaraj5597
opened
3 weeks ago
0
[TE/JAX] Disable FusedAttn with FFI by default
#1298
phu0ngng
closed
3 weeks ago
1
[PyTorch] Make FP8 MHA work with RoPE when CP is on
#1297
yaox12
closed
3 weeks ago
4
Next