NVIDIA / MinkowskiEngine

Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
https://nvidia.github.io/MinkowskiEngine
Other
2.43k stars 360 forks source link

Dose ME 0.5.4 support A100 ? #445

Closed wangfudong closed 2 years ago

wangfudong commented 2 years ago

Soory for still bothering you after reading some similar issues of ME like issue#330, issus#350, issue#52.

My problem I build ME=0.5.4 with anaconda virtualenv: pytorch=1.7.1; cudatoolkit=11.0 or 10.2 (with CUDA in system 11.0 or 10.2, respectively)

and system: ubuntu 18.04 nvidia driver: 450.80.2 (or 450.102.04) gcc 7.5.0

With the env above, ME-0.5.4 is tested successfully ( including ME.Conv, ME.BN, ME.ReLU, ME.interpolation, and loss.backward ) on GPUs T4 and P40, but fails with A100, the error is 'cudaErrorNoKernelImageForDevice no kernel image is available for execution on the device'. The details of output error: {"@timestamp":"2022-02-22 00:14:15.936","@message":" sparse_tensor = ME.SparseTensor(code, coord_sparse)"} {"@timestamp":"2022-02-22 00:14:15.936","@message":" File \"/opt/conda/envs/3dr3_cu113/lib/python3.8/site-packages/MinkowskiEngine/MinkowskiSparseTensor.py\", line 275, in init"} {"@timestamp":"2022-02-22 00:14:15.936","@message":" coordinates, features, coordinate_map_key = self.initialize_coordinates("} {"@timestamp":"2022-02-22 00:14:15.936","@message":" File \"/opt/conda/envs/3dr3_cu113/lib/python3.8/site-packages/MinkowskiEngine/MinkowskiSparseTensor.py\", line 304, in initialize_coordinates"} {"@timestamp":"2022-02-22 00:14:15.936","@message":" ) = self._manager.insert_and_map(coordinates, *coordinate_map_key.get_key())"} {"@timestamp":"2022-02-22 00:14:15.936","@message":" File \"/opt/conda/envs/3dr3_cu113/lib/python3.8/site-packages/MinkowskiEngine/MinkowskiCoordinateManager.py\", line 179, in insert_and_map"} {"@timestamp":"2022-02-22 00:14:15.936","@message":" return self._manager.insert_and_map(coordinates, tensor_stride, string_id)"} {"@timestamp":"2022-02-22 00:14:15.936","@message":"RuntimeError: CUDA error encountered at: /tmp/pip-req-build-16c08htu/src/3rdparty/concurrent_unordered_map.cuh:595: 209 cudaErrorNoKernelImageForDevice no kernel image is available for execution on the device"}

At First, I guess it may be caused by the compatibility between pytorch1.7 and the compute capability of A100. However, pytorch-1.7.1+cuda-11.0+driver-450.80.2 dose support A100 (I used a simple network without ME and it passed successfully).

Have you test ME on A100 and can it work well? Thank you very much~

wangfudong commented 2 years ago

The problem has been solved by using docker

RozDavid commented 1 year ago

As a note for others how find this issue like me, but don't want to use docker for their training, we just have to add export TORCH_CUDA_ARCH_LIST="6.0 6.1 6.2 7.0 7.2 7.5 8.0 8.6" to our script prior pip installing ME.

Xnhyacinth commented 1 month ago

by using

How do you solve it?