getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.03k stars 65 forks source link

Segfault when trying to use float16 on old GPUs (K80) #253

Open jeanfeydy opened 2 years ago

jeanfeydy commented 2 years ago

Hi all,

I am currently setting up the Singularity/Docker files and trying to render the documentation on a standard instance. During the development process, I use a cheap p2 instance on AWS EC2, with an old K80 GPU. Interestingly, the test for KeOps fail with:

Singularity> pytest -v pykeops/pykeops/test/
================================================= test session starts =================================================
platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0 -- /opt/conda/bin/python
cachedir: .pytest_cache
rootdir: /home/keops/pykeops
collected 50 items                                                                                                    

pykeops/pykeops/test/test_chunks.py::test_chunks PASSED                                                         [  2%]
pykeops/pykeops/test/test_chunks_ranges.py::test_chunk_ranges PASSED                                            [  4%]
pykeops/pykeops/test/test_complex.py::test_complex_fw PASSED                                                    [  6%]
pykeops/pykeops/test/test_complex_numpy.py::test_complex_numpy PASSED                                           [  8%]
pykeops/pykeops/test/test_contiguous_numpy.py::test_contiguous_numpy PASSED                                     [ 10%]
pykeops/pykeops/test/test_contiguous_torch.py::test_contiguous_torch PASSED                                     [ 12%]
pykeops/pykeops/test/test_finalchunks.py::test_finalchunk PASSED                                                [ 14%]
pykeops/pykeops/test/test_finalchunks_ranges.py::test_finalchunks_ranges PASSED                                 [ 16%]
pykeops/pykeops/test/test_float16.py::TestCase::test_float16_fw Fatal Python error: Segmentation fault

Thread 0x00007f8274dfd640 (most recent call first):
<no Python frame>

Current thread 0x00007f836aee5440 (most recent call first):
  File "/home/keops/keopscore/keopscore/binders/nvrtc/Gpu_link_compile.py", line 67 in generate_code
  File "/home/keops/keopscore/keopscore/binders/LinkCompile.py", line 101 in get_dll_and_params
  File "/home/keops/keopscore/keopscore/get_keops_dll.py", line 124 in get_keops_dll_impl
  File "/home/keops/keopscore/keopscore/utils/Cache.py", line 27 in __call__
  File "/home/keops/pykeops/pykeops/common/keops_io/LoadKeOps.py", line 126 in init
  File "/home/keops/pykeops/pykeops/common/keops_io/LoadKeOps.py", line 18 in __init__
  File "/home/keops/pykeops/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 15 in __init__
  File "/home/keops/keopscore/keopscore/utils/Cache.py", line 68 in __call__
  File "/home/keops/pykeops/pykeops/torch/generic/generic_red.py", line 78 in forward
  File "/home/keops/pykeops/pykeops/torch/generic/generic_red.py", line 624 in __call__
  File "/home/keops/pykeops/pykeops/common/lazy_tensor.py", line 937 in __call__
  File "/home/keops/pykeops/pykeops/common/lazy_tensor.py", line 755 in reduction
  File "/home/keops/pykeops/pykeops/common/lazy_tensor.py", line 1800 in sum
  File "/home/keops/pykeops/pykeops/test/test_float16.py", line 26 in fun
  File "/home/keops/pykeops/pykeops/test/test_float16.py", line 35 in test_float16_fw
  File "/opt/conda/lib/python3.8/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/lib/python3.8/site-packages/_pytest/python.py", line 1761 in runtest
  File "/opt/conda/lib/python3.8/site-packages/_pytest/runner.py", line 166 in pytest_runtest_call
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/lib/python3.8/site-packages/_pytest/runner.py", line 259 in <lambda>
  File "/opt/conda/lib/python3.8/site-packages/_pytest/runner.py", line 338 in from_call
  File "/opt/conda/lib/python3.8/site-packages/_pytest/runner.py", line 258 in call_runtest_hook
  File "/opt/conda/lib/python3.8/site-packages/_pytest/runner.py", line 219 in call_and_report
  File "/opt/conda/lib/python3.8/site-packages/_pytest/runner.py", line 130 in runtestprotocol
  File "/opt/conda/lib/python3.8/site-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/lib/python3.8/site-packages/_pytest/main.py", line 347 in pytest_runtestloop
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/lib/python3.8/site-packages/_pytest/main.py", line 322 in _main
  File "/opt/conda/lib/python3.8/site-packages/_pytest/main.py", line 268 in wrap_session
  File "/opt/conda/lib/python3.8/site-packages/_pytest/main.py", line 315 in pytest_cmdline_main
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/opt/conda/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/opt/conda/lib/python3.8/site-packages/_pytest/config/__init__.py", line 164 in main
  File "/opt/conda/lib/python3.8/site-packages/_pytest/config/__init__.py", line 187 in console_main
  File "/opt/conda/bin/pytest", line 8 in <module>
Segmentation fault (core dumped)

This is not surprising, since float16 capabilities were introduced after the release of the K80 GPU, but I wonder if we could detect this problem (using e.g. the compute capability) and fail in a more gracious manner. What do you think?

Best regards, Jean