Closed Cheny1m closed 9 months ago
This issue has been resolved by me, and I suspect it was caused by a series of compatibility issues with the system, python, and various expansion packages.
Initially I used pytorch, Megatron-DeepSpeed and Transformer primarily for MOE model training and inference. The initial installation of APEX was a source install
, which eventually led to the problem. After that I used the container installation
method recommended by APEX and successfully skipped the issue.
You can find the original environment information in the problem description,here's some additional information on torch
printing:
Here's a printout of my torch info after using docker:
docker pull nvcr.io/nvidia/pytorch:24.01-py3
docker run --gpus all --name xxx -itd -v /dev/shm:/dev/shm -v /xxx/your/Project/:/workspace nvcr.io/nvidia/pytorch:24.01-py3 /bin/bash
After that you can install your own dependencies and run the program!
Describe the Bug When I run the program to apex/normalization/fused_layer_norm.py I get the error memory format option is only supported by strided tensors. Minimal Steps/Code to Reproduce the Bug
I then checked the properties of the incoming tensor: ![image](https://github.com/NVIDIA/apex/assets/65207305/4f04579d-031d-4829-897c-cf940625933a) ![image](https://github.com/NVIDIA/apex/assets/65207305/8d850f83-aa3d-4dca-9ea7-6806619fafd8) All tensors are strided compliant, I don't understand why this error is reported. **Expected Behavior**This is the stack error message that appears when an error is reported:
How should I solve this problem?
Environment
OS:Linux Ubuntu 18.04.5 LTS GPU:2 x NVIDIA V100 Python3.9 (conda) CUDA11.8 Pytorch2.2.0 APEX: via C++/CUDA