Open samblouir opened 1 year ago
I met the same problem when using cuda11.3 cudnn 8.2.0 and installing through the doc guide. When I ran python -m alpa.test_install
, the second test (pipeshard) give exactly the same error. But I did not find anything related to bfloat16 in the test code.
Bumping this hoping someone managed to resolve this. I'm hitting a similar problems on the tests with CUDA 11.8 (CUDNN 8).
Please describe the bug Hi, Using a bfloat16, whether by initializing an embedding layer or casting a float32 to bfloat16, causes a double free exception and crash. Sometimes it just prints out that there was a segmentation fault or that a worker died, without a verbose explanation - it can depend on the method selected for alpa's parallelize function. This happens to me using either Shard Parallel or Pipeshard Parallel, but it used to work in an earlier version of Alpa.
Please describe the expected behavior The data is used as a bfloat16 or cast to bfloat16 from float32 without issue.
System information and environment
To Reproduce Steps to reproduce the behavior: (I am starting this using a SLURM script)
Run this py file with ray (It's a modified included test file)
If you comment out line 52, it will work again.
Screenshots
Code snippet to reproduce the problem
Additional information Add any other context about the problem here or include any logs that would be helpful to diagnose the problem. Please let me know if any more information can help. This doesn't seem to happen when casting a float32 to float16.