Open jasonkrone opened 4 months ago
Hi @jasonkrone. Did you compare the loss curve with the loss curve you got from your pyTorch implementation - the chart only shows 1 curve I believe. The first step in troubleshooting this would be to pass the same input to your Transformer layer implementation and to the TE implementation (with dropout set to 0 in order to have apples-to-apples comparison) and confirm that the outputs (both forward and backward) match (they will not match exactly due to numerical differences but they should be very close).
Summary I'm hitting a NaN loss issue when I use the TransformerLayer in place of a pytorch transformer layer I wrote.
Details I'm using the nvcr.io/nvidia/pytorch:24.04-py3 docker container. I train with pytorch FSDP and use bfloat16 mixed precision.
Question Has the TransformerEngine team trained a model with the TELlamaDecoderLayer to ensure that everything works as expected? If so, could you share this example as my use case is very similar.
Code Here's the code I wrote to wrap the TransformerLayer such that it uses the ROPE embeddings. This is the class I swapped in for my model.
In addition, here are the kwargs I send to the transformer layer.
Learning Curve See the attached learning curve which displays the NaN issue, which occurs around step #350.