BlackHolePerturbationToolkit / FastEMRIWaveforms

Blazingly fast EMRI waveforms
GNU General Public License v3.0
49 stars 28 forks source link

UnitTest Failures #5

Closed cchapmanbird closed 3 years ago

cchapmanbird commented 3 years ago

Hi, I'm fairly sure I have installed properly but some of the unittests fail (python -m unittest discover).

The three failed tests are test_aak, test_detector_frame and test_fast_and_slow, with the errors

  1. NameError: name 'pyWaveform_gpu' is not defined

2 & 3. NameError: name 'neural_layer_wrap' is not defined respectively.

Can you recommend anything to avoid this? I think this only affects GPU use - I have properly installed CuPy and have made sure CUDA is in my $PATH.

mikekatz04 commented 3 years ago

This is definitely an install issue. What is happening is that you are trying to run with use_gpu=True, but the GPU modules necessary for running those tests are not being found in your python installation. The first thing I would check is in the site-packages folder within your conda environment directory. Go to something like ....../anaconda3/envs/few_env/lib/python3.7/site-packages/few-1.2.1-py3.7-macosx-10.9-x86_64.egg/ . Check if you have the files: pygpuAAK.cpython-37m-x86_64-linux-gnu.so and pymatmul.cpython-37m-x86_64-linux-gnu.so.

If those files are not there, then something is happening on the install where the installer cannot find your CUDA binaries. The environmental variable CUDAHOME or CUDA_HOME must be set to the cuda home directory OR nvcc needs to be on your path. If those files are in the proper place, then the issue is most likely the path that your current python distribution is looking up when you run the code. If this is still not it, we can look into it further.

The relevant section of setup.py is below:

# First check if the CUDAHOME env variable is in use
if "CUDAHOME" in os.environ or "CUDA_HOME" in os.environ:
    try:
        home = os.environ["CUDAHOME"]
    except KeyError:
        home = os.environ["CUDA_HOME"]

    nvcc = pjoin(home, "bin", "nvcc")
else:
    # Otherwise, search the PATH for NVCC
    nvcc = find_in_path("nvcc", os.environ["PATH"])
    if nvcc is None:
        raise EnvironmentError(
            "The nvcc binary could not be "
            "located in your $PATH. Either add it to your path, "
            "or set $CUDAHOME"
        )
    home = os.path.dirname(os.path.dirname(nvcc))
cchapmanbird commented 3 years ago

Hi, thanks for getting back to me. You were right, I had not set CUDAHOME to the cuda home directory (I had added it to path instead). Now when I install with setup.py I hit a different error, though - /usr/local/cuda/bin/nvcc' failed with exit status 2 - this might be some kind of version incompatability but I have freshly installed CUDA 11.2 and nvcc was installed with it so I don't see why they would be incompatible. Do you have any suggestions? Thanks for the patience!

mikekatz04 commented 3 years ago

Is there any output associated with the failure? I have tested it up to CUDA 10.1, so I have not made the jump to CUDA 11.

cchapmanbird commented 3 years ago

Yes, it outputs 4 errors detected in the compilation of "src/matmul.cu. There is a lot before that (warnings about converting string literals) but I assume they are normal.

mikekatz04 commented 3 years ago

Can you post the 4 errors detected?

cchapmanbird commented 3 years ago

Above those warnings are the four errors, I think:

/usr/local/cuda/include/cuda_fp16.hpp(409): error: expected an identifier

/usr/local/cuda/include/cuda_fp16.hpp(411): error: expression must be an lvalue or a function designator

/usr/local/cuda/include/cuda_fp16.hpp(415): error: expression must have integral or unscoped enum type

/usr/local/cuda/include/cuda_fp16.hpp(438): error: expression must have integral or unscoped enum type

cchapmanbird commented 3 years ago

I've put the entire output on a pastebin: https://pastebin.com/pB31VFmd in case there is some more context further up that I've missed.

mikekatz04 commented 3 years ago

Okay. So, this is an error with the cuda files, rather than in the FEW files. See here for something similar: https://github.com/BVLC/caffe/issues/6011#issuecomment-583700749

What gcc version is being used? I will admit I am not super competent on solving these backend related issued.

cchapmanbird commented 3 years ago

I have gcc installed in the conda environment (gcc_linux-64 9.3.0). gcc_impl_linux-64 9.3.0 is also installed, but I don't know if that's relevant here. I have no idea what goes on behind the scenes with cuda installations, so your guess is better than mine!

mikekatz04 commented 3 years ago

Try installing an older version of gcc into conda: conda install -c conda-forge gcc_linux-64=7.3.0

cchapmanbird commented 3 years ago

Hi, no luck unfortunately, the same nvcc errors pop out. I tried uninstalling my system gcc/g++ compilers with apt, and then reinstalling the older versions of gcc and g++ as you suggested to ensure they were being used in the installation, but the same error comes out.

mikekatz04 commented 3 years ago

Hmm. I am puzzled currently. I would suggest writing a very basic "hello world" type cuda script to do a test with and see if you find the same issue when you compile and run it. This issue is coming from a header file for fp_16. fp_16 is is half-precision, 16-bit floating point. There is no FP16 usage within FEW. This has to be something with the compiler or related issues.

cchapmanbird commented 3 years ago

I wrote a very simple printf "hello world" cuda script and it compiled and ran successfully. I also compiled some of the samples they provide and was able to run them. This issue might not appear for something so simple though, and I don't know how to write CUDA beyond the basics - do you have a sample script I can run that might identify the issue?

mikekatz04 commented 3 years ago

What type of GPU are you running on (P100, A100, V100, etc.)?

cchapmanbird commented 3 years ago

I am running on a Geforce GTX 1660 SUPER. It has a CUDA compute capability of 7.5 I believe, so it should work.

mikekatz04 commented 3 years ago

I just realized as I looked back at the setup file, I currently have that CUDA ARCH not included for compilation. When debugging and working on code improvements I remove them so compilation is faster. Please try to uncomment this line (269) in setup.py: #'-gencode=arch=compute_75,code=sm_75',. Let me know if that works.

cchapmanbird commented 3 years ago

I'll give it a try tomorrow morning and let you know.

mikekatz04 commented 3 years ago

Thanks.

cchapmanbird commented 3 years ago

Hi there, no luck unfortunately - same error messages.

mikekatz04 commented 3 years ago

I am still not sure about this issue. I am trying to talk to people who would know better than but no such luck yet. Has there been any movement on your end?

cchapmanbird commented 3 years ago

Hi there, I haven't investigated the problem further on my end. It is very possible it's something unusual about my setup in particular but hard to verify without doing a clean install of everything - in the end I may give this a go though as this problem could occur again somewhere else later down the line if it's something like that.

mikekatz04 commented 3 years ago

If you find a solution let me know. When I find someone that knows more about this than I do, I will let you know what is determined.

cchapmanbird commented 3 years ago

Hi again, I am having another go at installing FEW with GPU capability. This time, I'm installing it on my conda environment on a cluster (to which I do not have admin privileges). The cluster has gcc-8 and gcc-7 installed, with 'which gcc' linking to gcc-8, and CUDA V10.0 (i.e. only supported by gcc-7). When I try to run the FEW setup.py, it tries to run GCC-8 even though the $CC environmental variable links to gcc-7.

I tried installing the conda gcc/gxx v 7.x and the same issues persist. Does the installation use the $CC variable to choose the right gcc version?

PS I can open this as a new issue if that makes more sense.

cchapmanbird commented 3 years ago

I installed nvcc_linux-64 with conda and updated the setup.py, which eliminated the incompatibility issue from before. Now, even though the installation completes successfully (with the aforementioned GPU modules found in the FEW site-packages directory) the unit tests still fail for the same reasons. Do you have any suggestions? The FEW site-packages directory is in PYTHONPATH, so that isn't the issue either.

Copying in @JohnVeitch as he is assisting on this.

NB. The output of the unit tests is

====================================================================== ERROR: test_aak (few.tests.test_aak.AAKWaveformTest)

Traceback (most recent call last): File "/scratch/wiay/christian/FastEMRIWaveforms/few/tests/test_aak.py", line 67, in test_aak wave_gpu = Pn5AAKWaveform(inspiral_kwargs, sum_kwargs, use_gpu=True) File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 968, in init self.create_waveform = AAKSummation(**sum_kwargs) File "/scratch/wiay/christian/FastEMRIWaveforms/few/summation/aakwave.py", line 80, in init self.waveform_generator = pyWaveform_gpu NameError: name 'pyWaveform_gpu' is not defined

====================================================================== ERROR: test_detector_frame (few.tests.test_detector_wave.WaveformTest)

Traceback (most recent call last): File "/scratch/wiay/christian/FastEMRIWaveforms/few/tests/test_detector_wave.py", line 28, in test_detector_frame "FastSchwarzschildEccentricFlux", *waveform_kwargs File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 87, in init self.waveform_generator = waveform(args, kwargs) File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 786, in init kwargs File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 421, in init self.amplitude_generator = amplitude_module(**amplitude_kwargs) File "/scratch/wiay/christian/FastEMRIWaveforms/few/amplitude/romannet.py", line 144, in init self.neural_layer = neural_layer_wrap NameError: name 'neural_layer_wrap' is not defined

====================================================================== ERROR: test_fast_and_slow (few.tests.test_few.WaveformTest)

Traceback (most recent call last): File "/scratch/wiay/christian/FastEMRIWaveforms/few/tests/test_few.py", line 59, in test_fast_and_slow use_gpu=gpu_available, File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 786, in init kwargs File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 421, in init self.amplitude_generator = amplitude_module(amplitude_kwargs) File "/scratch/wiay/christian/FastEMRIWaveforms/few/amplitude/romannet.py", line 144, in init self.neural_layer = neural_layer_wrap NameError: name 'neural_layer_wrap' is not defined


Ran 7 tests in 3.886s

FAILED (errors=3)

mikekatz04 commented 3 years ago

I will send you an email. Check for one.

mikekatz04 commented 3 years ago

I am going to mark this resolved due to the gcc-nvcc issue being the source. Reopen this if we still have issues after that is fixed.