brian-team / brian2genn

Brian 2 frontend to the GeNN simulator
http://brian2genn.readthedocs.io/
GNU General Public License v2.0
46 stars 16 forks source link

GeNN errors #110

Closed yeyeleijun closed 4 years ago

yeyeleijun commented 4 years ago

Hi all, I want to use Brian2GeNN and GeNN to speed my Brian2 script. However, I got some errors and I have put the command line output in the attachment. I do use the latest version of Brian (2.3.0.2) and Brian2GeNN (1.5) and have also set the CUDA_PATH and added GeNN's "bin" directory to my path as you can see from the command line output. And my script can run without using GPU and also get the results I want. I don't know why I get these errors. It seems that these errors may result from GeNN. And I run my script using python3.8 and gcc7.5 under anaconda and the linux machine is CentOS7.5.1804 and CUDA9.0. How should I fix these errors? Thank you very much. script.txt Command Line Output.txt

tnowotny commented 4 years ago

I believe these are problems arising in the new GeNN kernel merging procedure @neworderofjamie . I have identified two problems:

  1. "tauAMPA" is a parameter in a neuron population and there is an equally named variable in a Synampse group. This leads to the following:
    struct MergedSynapseDynamicsGroup1
    {
    double* inSyn;
    scalar tauAMPA;
    unsigned int* rowLength;
    uint32_t* ind;
    unsigned int* synRemap;
    double* sAMPA;
    double* tauAMPA;
    double* w;
    unsigned int rowStride;
    unsigned int numSrcNeurons;
    unsigned int numTrgNeurons;   
    };

    the "scalar" tauAMPA derives from the presynaptics neuron group I think and the "double *" tauAMPA from the synapses themselves ... this might need disambiguation ... also not quite clear to me why the presynaptic neuron parameter gets included here.

neworderofjamie commented 4 years ago

presynaptic neuron parameters should definitely have a _pre to disambiguate them - will investigate

tnowotny commented 4 years ago
  1. There is also a straightforward duplication of support code in the same namespace ...
yeyeleijun commented 4 years ago

Oh yes, you are right. I didn't notice that before. The two defined "tauAMPA" variable did cause ambiguation. Let me just fix it with the simplest way. I replaced all the synaptic parameters (tauAMPA, tauNMDA, ...) with its actual value, this will eliminate the ambiguation. I run the revised script again. Sadly, I still get the same errors. By the way, could you please run my script on your machine to see if these errors are caused by my script or my machine or both? Thank you very much.

Command Line Output.txt revised script.txt

neworderofjamie commented 4 years ago

We can both reproduce your errors and I are going to investigate today.

tnowotny commented 4 years ago

@yeyeleijun , you have indeed avoided one of the bugs now, but there is a second one - we are investigating ...

neworderofjamie commented 4 years ago

So we have fixed one issue in GeNN itself in the disambiguate_neuron_params branch and another in Brian2GeNN in the summedVar_supportCode_fix branch.

If you could confirm that the fixes work that would be awesome!

mstimberg commented 4 years ago

Hi, happy to hear that you are getting to the bottom of this and would be grateful for feedback from @yeyeleijun whether it all works with the changes.

A more general comment (which as a side effect might make the bug disappear even without the changes): It seems to me you are using a very inefficient way to simulate the AMPA and GABA synapses. You can read more about this in section 2.2. of Brette et al. (2007), but briefly: Instead of

NeuronGroup(..., '''
        I_AMPA        = gAMPA_E*sAMPA_tot*(V - V_E) : amp
        sAMPA_tot : 1'''
Synapses(... ,'''
        dsAMPA/dt     = -sAMPA/(2*ms) : 1 (clock-driven)
        sAMPA_tot_post = w * sAMPA: 1 (summed)
        w : 1
        ''', on_pre='sAMPA += 1')
)

you can use:

NeuronGroup(..., '''
            I_AMPA = gAMPA_E*sAMPA_tot*(V - V_E) : amp
            dsAMPA/dt = -sAMPA/(2*ms) : 1''')
Synapses(..., 'w : 1', on_pre='sAMPA_post += w')

Which is much more efficient since it does not have to perform operations for each synapse at each time step. Note that this is not just an approximation, the two formulations are mathematically equivalent!

yeyeleijun commented 4 years ago

Sorry for the late reply. Glad you find two bugs which can fix these errors. However, on my machine, I still get the same errors :(. I am not sure if I have changed the bugs you mentioned because I am a newer to Github. If possible, could you please send me the updated package of Brian2GeNN and GeNN to me?

mstimberg commented 4 years ago

On my machine, things work after using the two branches that @neworderofjamie mentioned.

How did you install Brian2GeNN and GeNN on your machine originally? I imagine you used pip for Brian2GeNN? If yes, you can use the following command to update it to the branch with the fix:

pip install https://github.com/brian-team/brian2genn/archive/summedVar_supportCode_fix.zip

For GeNN, did you download the release zip file and extracted it? In that case, you can download the corrected version here: https://github.com/genn-team/genn/archive/disambiguate_neuron_params.zip Note that it extracts into a genn-disambiguate_neuron_params directory, you'll have to move/rename it so that it matches the old name (genn-4.3.0) or configure PATH to refer to the new name.

yeyeleijun commented 4 years ago

Yes, I did just as you said @mstimberg. Sadly, the same error. It must be the problem of my machine. Actually, I keep getting this error when I use the GeNN package for the first time: PermissionError: [Errno 13] Permission denied: '/share/inspurStorage/home1/yelj/Downloads/genn-4.3.0/bin/genn-buildmodel.sh'

And I will fix it with "chmod 777 -R genn-4.3.0". Does this matter?

tnowotny commented 4 years ago

@yeyeleijun , how does your debug output look like now?

yeyeleijun commented 4 years ago

like this @tnowotny Command Line Output.txt

yeyeleijun commented 4 years ago

And I also have another question. Although I have added GeNN's "bin" directory to my path, I will get this output when I use the "which genn". (py38) [yelj@gpu18 ~]$ which genn /usr/bin/which: no genn in (/home1/yelj/anaconda3/envs/py38/bin:/home1/yelj/anaconda3/envs/py38/bin:/share/apps/freesurfer/bin:/share/apps/freesurfer/fsfast/bin:/share/apps/freesurfer/tktools:/share/apps/fsl/bin:/share/apps/freesurfer/mni/bin:/share/apps/fsl/bin:/share/apps/niftyreg/bin:/share/apps/workbench/workbench-v1.3.2/bin_rh_linux64:/share/apps/CUDA/cuda-9.0/bin:/share/apps/ANTs/bin:/share/apps/Qt5.9.1/5.9.1/gcc_64/bin:/share/apps/AFNI_18.3.03/linux_centos_7_64:/share/apps/R/3.6.0/bin:/share/apps/mrtrix3/bin:/share/apps/mricrogl_lx:/share/apps/common_tools/bin:/share/apps/MATLAB/R2018b/bin:/share/apps/freesurfer/bin:/share/apps/freesurfer/fsfast/bin:/share/apps/freesurfer/tktools:/share/apps/fsl/bin:/share/apps/freesurfer/mni/bin:/share/apps/fsl/bin:/share/apps/niftyreg/bin:/share/apps/workbench/workbench-v1.3.2/bin_rh_linux64:/share/apps/CUDA/cuda-9.0/bin:/share/apps/ANTs/bin:/share/apps/Qt5.9.1/5.9.1/gcc_64/bin:/share/apps/AFNI_18.3.03/linux_centos_7_64:/share/apps/R/3.6.0/bin:/share/apps/mrtrix3/bin:/share/apps/mricrogl_lx:/share/apps/common_tools/bin:/share/apps/MATLAB/R2018b/bin:/home1/yelj/anaconda3/condabin:/share/apps/freesurfer/bin:/share/apps/fsl-5.0.10/bin:/share/apps/Qt5.9.1/5.9.1/gcc_64/bin:/share/apps/AFNI_18.3.03/linux_centos_7_64:/share/apps/R/3.5.1/bin:/share/apps/mrtrix3/bin:/share/apps/mricrogl_lx:/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/share/xcat/tools:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/opt/pbs/bin:/home1/yelj/.local/bin:/home1/yelj/bin:/share/inspurStorage/home1/yelj/Downloads/genn-4.3.0/bin:/share/inspurStorage/home1/yelj/Downloads/genn-4.3.0/bin:/share/inspurStorage/home1/yelj/Downloads/genn-4.3.0/bin:/share/inspurStorage/home1/yelj/Downloads/genn-4.3.0/bin:/opt/pbs/bin:/share/inspurStorage/home1/yelj/Downloads/genn-4.3.0/bin:/opt/pbs/bin:/share/inspurStorage/home1/yelj/Downloads/genn-4.3.0/bin)

Dose this also matter? It seem that the script can find the GeNN path from my debug output.

tnowotny commented 4 years ago

I think you are having a problem with your cuda install:

/share/inspurStorage/home1/yelj/anaconda3/envs/py38/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.5.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
make: *** [/share/inspurStorage/home1/yelj/GeNNworkspace/generator] Error 1

This means linking to the CUDA library fails. Have you tested whether CUDA works?

tnowotny commented 4 years ago

Regarding your other question, there is no genn binary; what needs the bin path is mainly genn-buildmodel.sh

yeyeleijun commented 4 years ago

Actually, I have been using the CUDA in the cluster before (I think the CUDA in cluster should works because many users use it). So I reinstalled a new CUDA10.0 in my home directory and still get the same error: cannot find -lcuda collect2: error: ld returned 1 exit status.

The CUDA does have the "libcuda.so" file in "/share/inspurStorage/home1/yelj/CUDA/cuda10.0/lib64/stubs" directory.

By googling, I found a way to fix it: cd /share/inspurStorage/home1/yelj/anaconda3/envs/py38/lib/gcc/x86_64-conda_cos6-linux-gnu/7.5.0 ln -s /share/inspurStorage/home1/yelj/CUDA/cuda10.0/lib64/stubs/libcuda.so

However, a "Segmentation fault" occurs in genn-buildmodel.sh. output.txt

neworderofjamie commented 4 years ago

On a cluster system, those linker errors often mean that you are not actually logged in to a compute node i.e. you are on a login node

On Wed, 17 Jun 2020, 05:58 yeyeleijun, notifications@github.com wrote:

Actually, I have been using the CUDA in the cluster before (I think the CUDA in cluster should works because many users use it). So I reinstalled a new CUDA10.0 in my home directory and still get the same error: cannot find -lcuda collect2: error: ld returned 1 exit status.

The CUDA does have the "libcuda.so" file in "/share/inspurStorage/home1/yelj/CUDA/cuda10.0/lib64/stubs" directory.

By googling, I found a way to fix it: cd /share/inspurStorage/home1/yelj/anaconda3/envs/py38/lib/gcc/x86_64-conda_cos6-linux-gnu/7.5.0 ln -s /share/inspurStorage/home1/yelj/CUDA/cuda10.0/lib64/stubs/libcuda.so

However, a "Segmentation fault" occurs in genn-buildmodel.sh. output.txt https://github.com/brian-team/brian2genn/files/4790427/output.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/brian-team/brian2genn/issues/110#issuecomment-645150370, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABT2QGWP24Q5Y5KGZGQBIX3RXBENTANCNFSM4N7JZ5JQ .

yeyeleijun commented 4 years ago

Emm... I think I are on a gpu compute node @neworderofjamie. For gpu20, I get (py38) [yelj@gpu20 ~]$ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.44 Sun Dec 8 03:38:56 UTC 2019 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)

For head or node, I get (py38) [yelj@node02 ~]$ cat /proc/driver/nvidia/version cat: /proc/driver/nvidia/version: No such file or directory

neworderofjamie commented 4 years ago

And can you run nvidia-smi?

On Wed, 17 Jun 2020, 08:44 yeyeleijun, notifications@github.com wrote:

Emm... I think I are on a gpu compute node @neworderofjamie https://github.com/neworderofjamie. For gpu20, I get (py38) [yelj@gpu20 ~]$ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.44 Sun Dec 8 03:38:56 UTC 2019 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)

For head or node, I get (py38) [yelj@node02 ~]$ cat /proc/driver/nvidia/version cat: /proc/driver/nvidia/version: No such file or directory

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/brian-team/brian2genn/issues/110#issuecomment-645210915, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABT2QGTXQ6B2PHJ3SV75LL3RXBX6JANCNFSM4N7JZ5JQ .

yeyeleijun commented 4 years ago

(py38) [yelj@gpu20 ~]$ nvidia-smi Wed Jun 17 15:46:08 2020
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:3B:00.0 Off | N/A | | 37% 66C P2 254W / 250W | 10750MiB / 11178MiB | 86% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:AF:00.0 Off | N/A | | 23% 32C P8 8W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:D8:00.0 Off | N/A | | 23% 32C P8 8W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 266419 C python 983MiB | | 0 266536 C python 9759MiB | +-----------------------------------------------------------------------------+

tnowotny commented 4 years ago

Hum ... that is strange, I wonder whether you are linking to a CUDA library that is not compatible with the CUDA driver ... Instead of linking to your local CUDA install, can you set CUDA_PATH to point to the system installed CUDA directory?

yeyeleijun commented 4 years ago

Both the system CUDA and my local CUDA will get the same "-lcuda collect2: error". I reinstalled my anaconda, python3.8 environment, brian2, brian2genn and GeNN, all these things. Now the debug output shows I should update the GCC version. This is also a problem that have been disturbing me. My conda gcc version is 7.5 and my system gcc version is 4.8.5. Which gcc will genn choose, the conda or the system? It seems that in my old anaconda, GeNN chooses the conda gcc and in my new anaconda, GeNN chooses the system gcc. output.txt

tnowotny commented 4 years ago

Definitely looks now like a too old gcc I am afraid ... anyone else some ideas what to do?

mstimberg commented 4 years ago

If you install Brian2 via conda, then it will automatically installl the gcc in the environment (via packages like gxx_linux-64). They will be used automatically because conda sets the environment variables CC, GCC. etc. I think it does not work directly after installation if you already activated the environment before and are still in the same environment. Could you try conda deactivate and conda activate <yourenvname>? It should then use the compilers with names like x86_64-conda_cos6-linux-gnu-gcc which are some 7.x version.

yeyeleijun commented 4 years ago

Which gcc do you use, the conda gcc or the system gcc? The "-lcuda collect2: error" should come from the use of conda. /share/inspurStorage/home1/yelj/anaconda3/envs/py38/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.5.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: cannot find -lcuda collect2: error: ld returned 1 exit status make: *** [/share/inspurStorage/home1/yelj/GeNNworkspace/generator] Error 1

After trying "conda deactivate" and “conda activate”, it still chooses the system gcc.

mstimberg commented 4 years ago

Did you try deleting the GeNNworkspace directory first? If your environment has the conda gcc installed, it should set the environment variables CC, etc. (did you check?). If that is the case, I do not see how it could still choose the system gcc.

yeyeleijun commented 4 years ago

Yes, you are right @mstimberg. My environment didn't have the conda gcc installed. I installed brian2 via conda before. But now I installed via pip. That is why I don't have gcc in my environment. And again, the "-lcuda collect2: error" after installing gcc :(. It definitely be the problem of CUDA.

tnowotny commented 4 years ago

after talking to @neworderofjamie I believe the key to success now will be to identify where libcuda is located and why it is not in the linker's default path (which it usually is). On a typical Linux machine libcuda can be found in

/usr/lib/x86_64-linux-gnu

or similar, but might differ on your cluster install. Once you know where the right libcuda is, you could make sure it's found by setting LD_LIBRARY_PATH ... but of course there is the worry that it should already have been in a discoverable location ...

neworderofjamie commented 4 years ago

I think maybe the issue is that the conda-supplied gcc doesn't search that path

mstimberg commented 4 years ago

If you manage to find libcuda.so you could also directly symlink it into conda's lib directory (something like path-to-your-conda-env/lib64) – not the cleanest solution, but I did something like this once and it worked.

There are also options to install cuda via conda directly, there are the conda-forge packages cudatoolkit and cudatoolkit-dev, but they did not quite work for me when I tried. If you want to try them out make sure you do this in a new (e.g. cloned) environment – at least the cudatoolkit-dev package does not uninstall cleanly and afterwards everything is even more of a mess.

I also recently learned that there is a nvcc_linux-64 package. This is supposed to do exactly what you want to do: configure the conda environment so that it works with a system-wide CUDA installation. I never tried it out, but I saw it mentioned as the currently recommended solution. Not sure it works in your case, though. You'll have to to install it for the specific CUDA version you want to use (e.g. conda install nvcc_linux-64=10.2), but I think it is only provided for versions between 9.2 and 10.2...

yeyeleijun commented 4 years ago

I found libcuda.so in the cuda directory ‘/share/inspurStorage/apps/CUDA/cuda-10.0/lib64/stubs/libcuda.so’. And I did what @mstimberg said, symlink it into conda's lib directory '/share/inspurStorage/home1/yelj/anaconda3/envs/python38/x86_64-conda_cos6-linux-gnu/lib'. This does work and fixes the "-lcuda collect2: error"! However, a new error occurs, many undefined references.... Is this coming from the missing g++ compiler? Do I need to install a g++ in my python environment although I already have gcc? I also have the "/share/inspurStorage/home1/yelj/anaconda3/envs/python38/x86_64-conda_cos6-linux-gnu/include/c++/7.5.0" directory? output.txt

neworderofjamie commented 4 years ago

So that version is a 'stub', not the actual library (I don't fully understand what they're for) - the version you want to symlink is (on my system) located at /usr/lib/x86_64-linux-gnu - definitely not in the conda

mstimberg commented 4 years ago

Maybe a command like:

ldconfig -v 2>/dev/null | grep -v "^"$'\t'

could help you find the paths where to look for libcuda.so (you can ignore stuff with i386 or lib32 in it, CUDA is 64bit-only).

These stub library files were confusing me a lot, too. If I understand correctly, they are for situations where you want to build a binary but not run it. E.g. you build on a cluster's login node which does not have a GPU (and therefore no GPU drivers) and then run it on the compute node.

yeyeleijun commented 4 years ago

(python38) [yelj@gpu20 ~]$ locate libcuda.so /usr/lib64/libcuda.so /usr/lib64/libcuda.so.1 /usr/lib64/libcuda.so.440.44 This command shows that the machine does have libcuda.so in "/user/lib64/libcuda.so" directory, not conda libcuda.so although I don't see it in that directory.

mstimberg commented 4 years ago

So what happens if you symlink /usr/lib64/libcuda.so into /share/inspurStorage/home1/yelj/anaconda3/envs/python38/x86_64-conda_cos6-linux-gnu/lib (or .../lib64)?

yeyeleijun commented 4 years ago

The "-lcuda collect2: error" disappears. However, a new error occurs, suggesting "undefined references to ..... output.txt "

mstimberg commented 4 years ago

Just to make sure: did you delete GeNNworkspace before recompiling?

yeyeleijun commented 4 years ago

yes, still the same "undefined ...." errors

mstimberg commented 4 years ago

Could you also try to delete everything in the /share/inspurStorage/home1/yelj/Downloads/genn-4.3.0/lib/ directory -- these get compiled during the build process as well (but only once). Changing compilers/library between the initial compilation and its use by Brian2GeNN could mess things up.

yeyeleijun commented 4 years ago

That will lead to the following error /share/inspurStorage/home1/yelj/anaconda3/envs/python38/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.5.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: cannot find -lgenn_cuda_backend /share/inspurStorage/home1/yelj/anaconda3/envs/python38/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.5.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: cannot find -lgenn collect2: error: ld returned 1 exit status

And now there is no " /share/inspurStorage/home1/yelj/Downloads/genn-4.3.0/lib/" this directory even though I run my python script again.

mstimberg commented 4 years ago

That's odd, did you delete GeNNworkspace first? Either way, you can run make in the genn directory to recreate the lib files. But this should be done automatically when you run the script (but maybe not if there's still some half-finished compilation around in GeNNworkspace).

yeyeleijun commented 4 years ago

Yes, I deleted GeNNworkspace and the /share/inspurStorage/home1/yelj/Downloads/genn-4.3.0/lib/. Then run my python script again. The lib directory still don't recreat automatically. Only run ’make‘ in the genn directory can recreate the ‘lib’ files.

mstimberg commented 4 years ago

Oh I see, it's because you deleted the directory instead of deleting the files in the directory. Creating an empty lib directory would also have worked.

Either way, what about running your script now?

yeyeleijun commented 4 years ago

yes, I did delete the lib directory.. Deleting the files in libdirectory work. Now dubug output still shows undefined references to.....

mstimberg commented 4 years ago

Ok, I'm afraid I'm out of my depth here... I think the actual issue is the relocation R_X86_64_32 against '.rodata' can not be used when making a PIE object; recompile with -fPIE error in the beginning, which seems to indicate some mismatch between CFLAGS and LDFLAGS. As a random idea, I'd propose to delete GeNN's lib directory again and run:

CFLAGS="$CFLAGS -fPIE" make

to recreate it. But I hope @neworderofjamie can help...

neworderofjamie commented 4 years ago

Yeah, as @mstimberg says, this is a mismatch between the settings used to build your conda and gcc and CUDA. Personally I wouldn't use conda and install devtoolset to give you a newer system compiler https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/

yeyeleijun commented 4 years ago

I am not a root in the cluser. So I cann't install the compiler devtoolset-7. Really really appreatiate your help @mstimberg @tnowotny @neworderofjamie. Maybe we have nothing to do with this. I will run my code with only CPU :). Thanks again for your help!