Closed cchapmanbird closed 3 years ago
This is definitely an install issue. What is happening is that you are trying to run with use_gpu=True
, but the GPU modules necessary for running those tests are not being found in your python installation. The first thing I would check is in the site-packages folder within your conda environment directory. Go to something like ....../anaconda3/envs/few_env/lib/python3.7/site-packages/few-1.2.1-py3.7-macosx-10.9-x86_64.egg/ . Check if you have the files: pygpuAAK.cpython-37m-x86_64-linux-gnu.so and pymatmul.cpython-37m-x86_64-linux-gnu.so.
If those files are not there, then something is happening on the install where the installer cannot find your CUDA binaries. The environmental variable CUDAHOME or CUDA_HOME must be set to the cuda home directory OR nvcc needs to be on your path. If those files are in the proper place, then the issue is most likely the path that your current python distribution is looking up when you run the code. If this is still not it, we can look into it further.
The relevant section of setup.py is below:
# First check if the CUDAHOME env variable is in use
if "CUDAHOME" in os.environ or "CUDA_HOME" in os.environ:
try:
home = os.environ["CUDAHOME"]
except KeyError:
home = os.environ["CUDA_HOME"]
nvcc = pjoin(home, "bin", "nvcc")
else:
# Otherwise, search the PATH for NVCC
nvcc = find_in_path("nvcc", os.environ["PATH"])
if nvcc is None:
raise EnvironmentError(
"The nvcc binary could not be "
"located in your $PATH. Either add it to your path, "
"or set $CUDAHOME"
)
home = os.path.dirname(os.path.dirname(nvcc))
Hi, thanks for getting back to me. You were right, I had not set CUDAHOME to the cuda home directory (I had added it to path instead). Now when I install with setup.py I hit a different error, though - /usr/local/cuda/bin/nvcc' failed with exit status 2
- this might be some kind of version incompatability but I have freshly installed CUDA 11.2 and nvcc was installed with it so I don't see why they would be incompatible. Do you have any suggestions?
Thanks for the patience!
Is there any output associated with the failure? I have tested it up to CUDA 10.1, so I have not made the jump to CUDA 11.
Yes, it outputs 4 errors detected in the compilation of "src/matmul.cu
. There is a lot before that (warnings about converting string literals) but I assume they are normal.
Can you post the 4 errors detected?
Above those warnings are the four errors, I think:
/usr/local/cuda/include/cuda_fp16.hpp(409): error: expected an identifier
/usr/local/cuda/include/cuda_fp16.hpp(411): error: expression must be an lvalue or a function designator
/usr/local/cuda/include/cuda_fp16.hpp(415): error: expression must have integral or unscoped enum type
/usr/local/cuda/include/cuda_fp16.hpp(438): error: expression must have integral or unscoped enum type
I've put the entire output on a pastebin: https://pastebin.com/pB31VFmd in case there is some more context further up that I've missed.
Okay. So, this is an error with the cuda files, rather than in the FEW files. See here for something similar: https://github.com/BVLC/caffe/issues/6011#issuecomment-583700749
What gcc version is being used? I will admit I am not super competent on solving these backend related issued.
I have gcc installed in the conda environment (gcc_linux-64 9.3.0). gcc_impl_linux-64 9.3.0 is also installed, but I don't know if that's relevant here. I have no idea what goes on behind the scenes with cuda installations, so your guess is better than mine!
Try installing an older version of gcc into conda: conda install -c conda-forge gcc_linux-64=7.3.0
Hi, no luck unfortunately, the same nvcc errors pop out. I tried uninstalling my system gcc/g++ compilers with apt, and then reinstalling the older versions of gcc and g++ as you suggested to ensure they were being used in the installation, but the same error comes out.
Hmm. I am puzzled currently. I would suggest writing a very basic "hello world" type cuda script to do a test with and see if you find the same issue when you compile and run it. This issue is coming from a header file for fp_16. fp_16 is is half-precision, 16-bit floating point. There is no FP16 usage within FEW. This has to be something with the compiler or related issues.
I wrote a very simple printf "hello world" cuda script and it compiled and ran successfully. I also compiled some of the samples they provide and was able to run them. This issue might not appear for something so simple though, and I don't know how to write CUDA beyond the basics - do you have a sample script I can run that might identify the issue?
What type of GPU are you running on (P100, A100, V100, etc.)?
I am running on a Geforce GTX 1660 SUPER. It has a CUDA compute capability of 7.5 I believe, so it should work.
I just realized as I looked back at the setup file, I currently have that CUDA ARCH not included for compilation. When debugging and working on code improvements I remove them so compilation is faster. Please try to uncomment this line (269) in setup.py: #'-gencode=arch=compute_75,code=sm_75',
. Let me know if that works.
I'll give it a try tomorrow morning and let you know.
Thanks.
Hi there, no luck unfortunately - same error messages.
I am still not sure about this issue. I am trying to talk to people who would know better than but no such luck yet. Has there been any movement on your end?
Hi there, I haven't investigated the problem further on my end. It is very possible it's something unusual about my setup in particular but hard to verify without doing a clean install of everything - in the end I may give this a go though as this problem could occur again somewhere else later down the line if it's something like that.
If you find a solution let me know. When I find someone that knows more about this than I do, I will let you know what is determined.
Hi again, I am having another go at installing FEW with GPU capability. This time, I'm installing it on my conda environment on a cluster (to which I do not have admin privileges). The cluster has gcc-8 and gcc-7 installed, with 'which gcc' linking to gcc-8, and CUDA V10.0 (i.e. only supported by gcc-7). When I try to run the FEW setup.py, it tries to run GCC-8 even though the $CC environmental variable links to gcc-7.
I tried installing the conda gcc/gxx v 7.x and the same issues persist. Does the installation use the $CC variable to choose the right gcc version?
PS I can open this as a new issue if that makes more sense.
I installed nvcc_linux-64 with conda and updated the setup.py, which eliminated the incompatibility issue from before. Now, even though the installation completes successfully (with the aforementioned GPU modules found in the FEW site-packages directory) the unit tests still fail for the same reasons. Do you have any suggestions? The FEW site-packages directory is in PYTHONPATH, so that isn't the issue either.
Copying in @JohnVeitch as he is assisting on this.
NB. The output of the unit tests is
Traceback (most recent call last): File "/scratch/wiay/christian/FastEMRIWaveforms/few/tests/test_aak.py", line 67, in test_aak wave_gpu = Pn5AAKWaveform(inspiral_kwargs, sum_kwargs, use_gpu=True) File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 968, in init self.create_waveform = AAKSummation(**sum_kwargs) File "/scratch/wiay/christian/FastEMRIWaveforms/few/summation/aakwave.py", line 80, in init self.waveform_generator = pyWaveform_gpu NameError: name 'pyWaveform_gpu' is not defined
Traceback (most recent call last): File "/scratch/wiay/christian/FastEMRIWaveforms/few/tests/test_detector_wave.py", line 28, in test_detector_frame "FastSchwarzschildEccentricFlux", *waveform_kwargs File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 87, in init self.waveform_generator = waveform(args, kwargs) File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 786, in init kwargs File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 421, in init self.amplitude_generator = amplitude_module(**amplitude_kwargs) File "/scratch/wiay/christian/FastEMRIWaveforms/few/amplitude/romannet.py", line 144, in init self.neural_layer = neural_layer_wrap NameError: name 'neural_layer_wrap' is not defined
Traceback (most recent call last): File "/scratch/wiay/christian/FastEMRIWaveforms/few/tests/test_few.py", line 59, in test_fast_and_slow use_gpu=gpu_available, File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 786, in init kwargs File "/scratch/wiay/christian/FastEMRIWaveforms/few/waveform.py", line 421, in init self.amplitude_generator = amplitude_module(amplitude_kwargs) File "/scratch/wiay/christian/FastEMRIWaveforms/few/amplitude/romannet.py", line 144, in init self.neural_layer = neural_layer_wrap NameError: name 'neural_layer_wrap' is not defined
Ran 7 tests in 3.886s
FAILED (errors=3)
I will send you an email. Check for one.
I am going to mark this resolved due to the gcc-nvcc issue being the source. Reopen this if we still have issues after that is fixed.
Hi, I'm fairly sure I have installed properly but some of the unittests fail (
python -m unittest discover
).The three failed tests are
test_aak
,test_detector_frame
andtest_fast_and_slow
, with the errorsNameError: name 'pyWaveform_gpu' is not defined
2 & 3.
NameError: name 'neural_layer_wrap' is not defined
respectively.Can you recommend anything to avoid this? I think this only affects GPU use - I have properly installed CuPy and have made sure CUDA is in my $PATH.