NVIDIA TransformerEngine issues

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Apache License 2.0

1.99k stars 331 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

[Draft] Introduce NVSHMEM based communication API for pytorch

#1346 gdengk opened 3 days ago
0
Fix cuda graph capture for grouped gemm

#1345 xrennvidia opened 4 days ago
1
How to setup TP Overlap configs

#1344 TJ-Solergibert opened 4 days ago
0
[PyTorch] Adding TP overlap support for `te.Linear` with `parallel_mode="column"`

#1343 denera opened 5 days ago
1
[Core] Add function to convert container to string

#1342 timmoon10 closed 4 days ago
1
[PyTorch] Bugfix for wgrad bulk overlap conflict when dgrad overlap is reduce-scatter

#1341 denera opened 1 week ago
2
Update list of CI users

#1340 timmoon10 opened 1 week ago
1
[Common] Moved framework agnostic THD kernels to common.

#1339 mgoldfarb-nvidia closed 12 hours ago
8
Debug nightly docs

#1338 timmoon10 opened 1 week ago
1
[C/JAX] Comm+GEMM Overlap API for TE/JAX

#1337 denera opened 1 week ago
0
the max error of moe_permute/unpermute.grad could reach 3.6e+00

#1336 NiuMa-1234 opened 1 week ago
1
[PyTorch] Store module extra state in tensor

#1335 timmoon10 opened 1 week ago
1
[PyTorch] Fix multiple calls to saved_tensors in CP attention

#1334 ksivaman closed 1 week ago
1
Use `CMAKE_CURRENT_SOURCE_DIR` instead of `CMAKE_SOURCE_DIR`

#1333 kmaehashi closed 1 week ago
0
[TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container)

#1332 erhoo82 opened 1 week ago
3
[JAX] WIP Added L0 Distributed Tests

#1331 phu0ngng opened 1 week ago
0
[Dummy] Testing branch for #1326

#1330 timmoon10 closed 1 week ago
0
[PyTorch] Integration test for Megatron-LM

#1329 timmoon10 closed 5 days ago
2
[PyTorch] Fix GQA error message

#1328 cyanguwa closed 5 days ago
1
[COMMON/JAX] Support sliding window on THD format

#1327 zlsh80826 opened 2 weeks ago
2
[PyTorch] Remove special handling for FP8 params in FP8 recipe infrastructure

#1326 timmoon10 closed 1 week ago
3
Fix an int conversion error

#1325 jennifgcrl closed 1 week ago
1
Build with uv instead of just pip

#1324 jennifgcrl opened 2 weeks ago
1
TransformerEngine doesn't work with uv

#1323 jennifgcrl opened 2 weeks ago
2
Convert non-kernel cuda files to cpp

#1322 ksivaman closed 2 weeks ago
2
nemo llm pretrain raised Exception: No dot product attention support for the provided inputs

#1321 ycchenzheng closed 2 weeks ago
1
[PyTorch] Fix ONNX export bug with operation-based API

#1320 timmoon10 closed 1 week ago
1
[TE/JAX] XLA FFI calls for Softmax and FusedAttnBackward

#1319 huanghua1994 closed 1 week ago
5
How can I use fp8_gemm to realize the function of "torch.mm()"?

#1318 duomicoding opened 2 weeks ago
0
[bug] Failed to load pretrained model with huggingface transformers

#1317 kehuanfeng opened 2 weeks ago
2
Update list of CI users

#1316 timmoon10 closed 2 weeks ago
0
[C] Normalization Refactor + Adding CUDNN backend

#1315 phu0ngng opened 2 weeks ago
0
[C] Separating cudnn common utils from fused_attn

#1314 phu0ngng closed 2 weeks ago
2
[JAX] Added prepare phase for the FusedAttnForwardFFI

#1313 phu0ngng closed 2 weeks ago
2
Linear does not support TP comm overlap for Column Parallel mode

#1312 parthmannan opened 2 weeks ago
0
TP communication overlap: enable the overlap between GEMM chunk at Ho…

#1311 erhoo82 opened 3 weeks ago
1
[TE/JAX] XLA FFI calls for three cast transpose functions

#1310 huanghua1994 closed 2 weeks ago
1
about PyTorch 2.5 install te

#1309 klhhhhh closed 2 weeks ago
7
Improving communication overlap for the case of multi kernel queue usage

#1308 youngeunkwon0405 opened 3 weeks ago
10
[JAX] Collective GEMM custom op with `nvte_cublas_gemm` (no comm. overlap)

#1307 denera opened 3 weeks ago
2
`intra_domain_ranks` is not defined in one of the execution paths.

#1306 erhoo82 closed 3 weeks ago
1
[PyTorch] Missing intra-domain ranks list when initializing Userbuffers with data parallelism

#1305 denera closed 3 weeks ago
0
[JAX] Fix for Disable FusedAttn with FFI by default

#1304 phu0ngng closed 3 weeks ago
1
[QUESTION] Does TP overlap support variable sequence length?

#1303 wplf closed 2 weeks ago
5
Update cudnn-frontend to 1.8.0

#1302 cyanguwa closed 3 weeks ago
0
[JAX] Add back the xla deterministic flag

#1301 zlsh80826 closed 2 weeks ago
4
[PyTorch] Add heuristics for intializing FP8 params

#1300 timmoon10 opened 3 weeks ago
2
Offloading example

#1299 sanandaraj5597 opened 3 weeks ago
0
[TE/JAX] Disable FusedAttn with FFI by default

#1298 phu0ngng closed 3 weeks ago
1
[PyTorch] Make FP8 MHA work with RoPE when CP is on

#1297 yaox12 closed 3 weeks ago
4