FLAMEGPU / FLAMEGPU2

FLAME GPU 2 is a GPU accelerated agent based modelling framework for CUDA C++ and Python
https://flamegpu.com
MIT License
105 stars 20 forks source link

[BugReport] python_native fails to compile #1192

Closed society-research closed 6 months ago

society-research commented 6 months ago

Dear FLAMEGPU2 devs, I ran into a pretty straight forward issue, so it's likely I did some setup wrong, if another way to get in touch is preferred over a bug report, please let me know!

How to reproduce:

  1. Build FLAMEGPU2, with FLAMEGPU_BUILD_PYTHON=ON, FLAMEGPU_VISUALISATION=ON, CMAKE_BUILD_TYPE=Release.

  2. Run

    (venv) ➜  build git:(master) ✗ python ../examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py
    Traceback (most recent call last):
    File "/home/ubuntu/model-socix-py/third_party/FLAMEGPU2/build/../examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py", line 389, in <module>
    cudaSimulation.initialise(sys.argv)
    File "/home/ubuntu/model-socix-py/third_party/FLAMEGPU2/build/venv/lib/python3.10/site-packages/pyflamegpu/pyflamegpu.py", line 9089, in initialise
    return _pyflamegpu.Simulation_initialise(self, argc)
    pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (InvalidAgentFunc) /home/ubuntu/FLAMEGPU2-model-template-python/third_party/FLAMEGPU2/src/flamegpu/detail/JitifyCache.cu(422): Error compiling runtime agent function (or function condition) (
    'outputdata'): function had compilation errors (see std::cout), in JitifyCache::buildProgram().

System Information: Ubuntu 22.04.4 LTS (GNU/Linux 6.5.0-26-generic x86_64), NVIDIA RTX A4000.

(venv) ➜  build git:(master) ✗ nvidia-smi
Fri Mar 29 13:36:47 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000               On  | 00000000:00:05.0 Off |                  Off |
| 41%   33C    P8              15W / 140W |      1MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
(venv) ➜  build git:(master) ✗ dpkg -l|grep cuda
ii  cuda-cccl-12-2                       12.2.140-1                              amd64        CUDA CCCL
ii  cuda-command-line-tools-12-2         12.2.2-1                                amd64        CUDA command-line tools
ii  cuda-compiler-12-2                   12.2.2-1                                amd64        CUDA compiler
ii  cuda-crt-12-2                        12.2.140-1                              amd64        CUDA crt
ii  cuda-cudart-12-2                     12.2.140-1                              amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-12-2                 12.2.140-1                              amd64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-12-2                  12.2.140-1                              amd64        CUDA cuobjdump
ii  cuda-cupti-12-2                      12.2.142-1                              amd64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-12-2                  12.2.142-1                              amd64        CUDA profiling tools interface.
ii  cuda-cuxxfilt-12-2                   12.2.140-1                              amd64        CUDA cuxxfilt
ii  cuda-documentation-12-2              12.2.140-1                              amd64        CUDA documentation
ii  cuda-driver-dev-12-2                 12.2.140-1                              amd64        CUDA Driver native dev stub library
ii  cuda-gdb-12-2                        12.2.140-1                              amd64        CUDA-GDB
ii  cuda-keyring                         1.1-1                                   all          GPG keyring for the CUDA repository
ii  cuda-libraries-12-2                  12.2.2-1                                amd64        CUDA Libraries 12.2 meta-package
ii  cuda-libraries-dev-12-2              12.2.2-1                                amd64        CUDA Libraries 12.2 development meta-package
ii  cuda-nsight-12-2                     12.2.144-1                              amd64        CUDA nsight
ii  cuda-nsight-compute-12-2             12.2.2-1                                amd64        NVIDIA Nsight Compute
ii  cuda-nsight-systems-12-2             12.2.2-1                                amd64        NVIDIA Nsight Systems
ii  cuda-nvcc-12-2                       12.2.140-1                              amd64        CUDA nvcc
ii  cuda-nvdisasm-12-2                   12.2.140-1                              amd64        CUDA disassembler
ii  cuda-nvml-dev-12-2                   12.2.140-1                              amd64        NVML native dev links, headers
ii  cuda-nvprof-12-2                     12.2.142-1                              amd64        CUDA Profiler tools
ii  cuda-nvprune-12-2                    12.2.140-1                              amd64        CUDA nvprune
ii  cuda-nvrtc-12-2                      12.2.140-1                              amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-12-2                  12.2.140-1                              amd64        NVRTC native dev links, headers
ii  cuda-nvtx-12-2                       12.2.140-1                              amd64        NVIDIA Tools Extension
ii  cuda-nvvm-12-2                       12.2.140-1                              amd64        CUDA nvvm
ii  cuda-nvvp-12-2                       12.2.142-1                              amd64        CUDA Profiler tools
ii  cuda-opencl-12-2                     12.2.140-1                              amd64        CUDA OpenCL native Libraries
ii  cuda-opencl-dev-12-2                 12.2.140-1                              amd64        CUDA OpenCL native dev links, headers
ii  cuda-profiler-api-12-2               12.2.140-1                              amd64        CUDA Profiler API
ii  cuda-sanitizer-12-2                  12.2.140-1                              amd64        CUDA Sanitizer
ii  cuda-toolkit-12-2                    12.2.2-1                                amd64        CUDA Toolkit 12.2 meta-package
ii  cuda-toolkit-12-2-config-common      12.2.140-1                              all          Common config package for CUDA Toolkit 12.2.
ii  cuda-toolkit-12-config-common        12.4.99-1                               all          Common config package for CUDA Toolkit 12.
ii  cuda-toolkit-config-common           12.4.99-1                               all          Common config package for CUDA Toolkit.
ii  cuda-tools-12-2                      12.2.2-1                                amd64        CUDA Tools meta-package
ii  cuda-visual-tools-12-2               12.2.2-1                                amd64        CUDA visual tools
Robadob commented 6 months ago
pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (InvalidAgentFunc) /home/ubuntu/FLAMEGPU2-model-template-python/third_party/FLAMEGPU2/src/flamegpu/detail/JitifyCache.cu(422): Error compiling runtime agent function (or function condition) (
'outputdata'): function had compilation errors (see std::cout), in JitifyCache::buildProgram().

This is an error trying to compile the agent function outputdata at runtime.

I'm assuming you've shared stderr, do you have stdout in a separate output file? (There is/was an issue where Google collab would eat the runtime compilation error messages, but I'm not aware of it occurring outside of collab).

Likewise, are you able to share the agent function. This would help me identify what your compilation error could be.

Robadob commented 6 months ago

Ah sorry, just spotted this is with one of the examples. Give me an hour to look into it.

Robadob commented 6 months ago

I've just built a clean copy of pyflamegpu from FLAMEGPU2's master branch.

When running boids_spatial3D.py, the runtime compilation of the model, that is failing for you, succeeds for me.

Are you able to share:


It did fail to run under a debug build however.

(venv) C:\Users\Robadob\fgpu2\examples\python_native\boids_spatial3D_wrapped>python boids_spatial3D.py
instanced_default_Tcolor_Tpos_Tdir_Tscale-material_flat_Tcolor: Generic vertex attrib named: _normal2 was not found.
instanced_default_Tcolor_Tpos_Tdir_Tscale-material_flat_Tcolor: Generic vertex attrib named: _normal2 was not found.
Device function 'inputdata' reported 40000 errors.
First error:
flamegpu/runtime/messaging/MessageSpatial3D/MessageSpatial3DDevice.cuh(545)[14,0,0][0,0,0]:
Spatial messaging radius (0.05) is not a factor of environment dimensions (1, 1, 1), this is unsupported for the wrapped iterator, MessageSpatial3D::In::wrap().

Traceback (most recent call last):
  File "C:\Users\Robadob\fgpu2\examples\python_native\boids_spatial3D_wrapped\boids_spatial3D.py", line 430, in <module>
    cudaSimulation.simulate()
  File "C:\Users\Robadob\fgpu2\build\lib\Debug\python\venv\lib\site-packages\pyflamegpu\pyflamegpu.py", line 9255, in simulate
    return _pyflamegpu.CUDASimulation_simulate(self, *args)
pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (DeviceError) Device function 'inputdata' reported 40000 errors.
First error:
flamegpu/runtime/messaging/MessageSpatial3D/MessageSpatial3DDevice.cuh(545)[14,0,0][0,0,0]:
Spatial messaging radius (0.05) is not a factor of environment dimensions (1, 1, 1), this is unsupported for the wrapped iterator, MessageSpatial3D::In::wrap().

But I think I must have broken this with this PR: https://github.com/FLAMEGPU/FLAMEGPU2/pull/1160, so that's a separate issue. Second time today, I refer to @ptheywood, I know he was looking at moving this check to outside of device code (#1182).

society-research commented 6 months ago

Are you able to share:

  • Which version of flamegpu you're using? E.g. are you pulling a release tag rather than HEAD of master?

I'm using the master branch, just clone the repository the day before yesterday.

  • The stdout that includes the runtime compilation error output by Jitify?

I have no idea where stdout is going. I'm not redirecting anything in my shell. Is stdout redirected by default somewhere? If so where?

I'd love to help! Could you let me know how to get the debug output that is present in your call to boids_spatial3D.py? Even with a CMAKE_BUILD_TYPE=Debug build I don't get that debug information that is printed to your terminal. Do you set any environment variable? How can I trace the compilation error to a line in the python code?

Robadob commented 6 months ago

I've now built it with CUDA 12.2 on Linux (no visualisation though, I've only got access to headless Linux boxes). Same issue post-runtime compilation that I was getting on Windows (with Visualisation) yesterday (already known #1177, with a few suitable workarounds).

(py311) rob@mavericks:~/fgpu2/build/lib/Release/python/venv/bin$ source activate
(venv) (py311) rob@mavericks:~/fgpu2/build/lib/Release/python/venv/bin$ cd ../../../..
(venv) (py311) rob@mavericks:~/fgpu2/build/lib$ cd ../..
(venv) (py311) rob@mavericks:~/fgpu2$ cd examples/python_native/boids_spatial3D_wrapped/
(venv) (py311) rob@mavericks:~/fgpu2/examples/python_native/boids_spatial3D_wrapped$ python boids_spatial3D.py
Traceback (most recent call last):
  File "/home/rob/fgpu2/examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py", line 430, in <module>
    cudaSimulation.simulate()
  File "/home/rob/fgpu2/build/lib/Release/python/venv/lib/python3.11/site-packages/pyflamegpu/pyflamegpu.py", line 9255, in simulate
    return _pyflamegpu.CUDASimulation_simulate(self, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (DeviceError) Device function 'inputdata' reported 40000 errors.
First error:
flamegpu/runtime/messaging/MessageSpatial3D/MessageSpatial3DDevice.cuh(545)[39,0,0][608,0,0]:
Spatial messaging radius (0.05) is not a factor of environment dimensions (1, 1, 1), this is unsupported for the wrapped iterator, MessageSpatial3D::In::wrap().

Likewise, I called the same example from build rather than it's own dir, and got much the same output.

I have no idea where stdout is going. I'm not redirecting anything in my shell. Is stdout redirected by default somewhere? If so where?

Runtime compilation errors should go to regular stdout by default, I wrongly assumed you might be running on HPC or similar that splits stdout/stderr into separate files. It's handled by a 3rd party lib, and I can't recall the reason it didn't/doesn't work properly on Google collab.

I forced a compilation error in the same example, and this is how it appeared.

(venv) (py311) rob@mavericks:~/fgpu2/examples/python_native/boids_spatial3D_wrapped$ python boids_spatial3D.py
---------------------------------------------------
--- JIT compile log for inputdata_program ---
---------------------------------------------------
inputdata_impl.cu(37): error: too few arguments in function call
              auto separation = vec3Length((agent_x - message_x), (agent_y - message_y));
                                                                                       ^

1 error detected in the compilation of "inputdata_program".

---------------------------------------------------
Traceback (most recent call last):
  File "/home/rob/fgpu2/examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py", line 388, in <module>
    cudaSimulation.initialise(sys.argv)
  File "/home/rob/fgpu2/build/lib/Release/python/venv/lib/python3.11/site-packages/pyflamegpu/pyflamegpu.py", line 9089, in initialise
    return _pyflamegpu.Simulation_initialise(self, argc)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (InvalidAgentFunc) /home/rob/fgpu2/src/flamegpu/detail/JitifyCache.cu(422): Error compiling runtime agent function (or function condition) ('inputdata'): function had compilation errors (see std::cout), in JitifyCache::buildProgram().

Technically it could be a non-compilation exception being thrown by Jitify, hence no compilation log, but I'm not sure what that would be.

At src/flamegpu/detail/JitifyCache:420-425 you will find the try/catch that is eating this exception.

    } catch (std::runtime_error const&) {
        // jitify does not have a method for getting compile logs so rely on JITIFY_PRINT_LOG defined in cmake
        THROW exception::InvalidAgentFunc("Error compiling runtime agent function (or function condition) ('%s'): function had compilation errors (see std::cout), "
            "in JitifyCache::buildProgram().",
            func_name.c_str());
    }

If you replace that with the below statement, recompile pyflamegpu and run the example again you may get more useful information out.

    } catch (std::runtime_error const&e) {
        printf(e.what());
        throw;
    }

Given I can't reproduce it locally, it's difficult for me to suggest much else at this time (and I suspect similar of my colleagues).

ptheywood commented 6 months ago

It's handled by a 3rd party lib, and I can't recall the reason it didn't/doesn't work properly on Google collab.

The version if jupyter/ipykernal on google colab has a bug that consumes stderr. This was fixed in ipykern in 2021, and I opened an issue with colab in 2021 about this (https://github.com/googlecolab/colabtools/issues/2230), but colab is still running ipykernel 5.5.6.

https://github.com/FLAMEGPU/FLAMEGPU2-tutorial-python/issues/10


I've attempted to reproduce your issue as well under linux with visualisation, but as @Robadob found the example compiles successfully for me, before hitting the known issue at runtime with wrapped communication, with commit b0ec5f3e98c6336afec0662bd9d7a1248b28bccb (current master), nvcc 12.2.140, gcc 11.4.0.

cmake .. -DCMAKE_CUDA_ARCHITECTURES=86 -DFLAMEGPU_BUILD_PYTHON=ON -DFLAMEGPU_VISUALISATION=ON
cmake --build . --target pyflamegpu -j 8 
source lib/Release/python/venv/bin/activate
$ python3 ../examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py 
instanced_default_Tcolor_Tpos_Tdir_Tscale-material_flat_Tcolor: Generic vertex attrib named: _normal2 was not found.
instanced_default_Tcolor_Tpos_Tdir_Tscale-material_flat_Tcolor: Generic vertex attrib named: _normal2 was not found.
Traceback (most recent call last):
  File "/home/ptheywood/code/flamegpu/FLAMEGPU2/build-12-2-vis/../examples/python_native/boids_spatial3D_wrapped/boids_spatial3D.py", line 430, in <module>
    cudaSimulation.simulate()
  File "/home/ptheywood/code/flamegpu/FLAMEGPU2/build-12-2-vis/lib/Release/python/venv/lib/python3.10/site-packages/pyflamegpu/pyflamegpu.py", line 9255, in simulate
    return _pyflamegpu.CUDASimulation_simulate(self, *args)
pyflamegpu.pyflamegpu.FLAMEGPURuntimeException: (DeviceError) Device function 'inputdata' reported 40000 errors.
First error:
flamegpu/runtime/messaging/MessageSpatial3D/MessageSpatial3DDevice.cuh(545)[12,0,0][864,0,0]:
Spatial messaging radius (0.05) is not a factor of environment dimensions (1, 1, 1), this is unsupported for the wrapped iterator, MessageSpatial3D::In::wrap().

Unfortunatley this does not help us narrow down the issues you are having.

Given you appear to have also built pyflamegpu from source successfully, it appears you must have completed the CUDA post-installation steps too (and LD_LIBRARY_PATH wouldn't be the issue).

You also don't appear to have multiple CUDA installations installed (atleast not via apt/dpkg) which would have been my next suggestion too of checkign the value of CUDA_HOME or CUDA_PATH environment variable at runtime.


Are you just running this in a bash terminal, or from within an editor's terminal or similar? Unless you are running via an old version of jupyter I'm not aware of any reasons why the stdout would not be getting printed.

society-research commented 6 months ago

Are you just running this in a bash terminal, or from within an editor's terminal or similar? Unless you are running via an old version of jupyter I'm not aware of any reasons why the stdout would not be getting printed.

Yes, I'm running in a plain zsh nothing fancy around this shell.

Same issue post-runtime compilation that I was getting on Windows (with Visualisation) yesterday (already known https://github.com/FLAMEGPU/FLAMEGPU2/issues/1177, with a few suitable workarounds).

I had to switch to another GPU hosting service, now I'm getting the same issue as mentioned in #1177 with both Release and Debug builds and can no longer reproduce the vanishing stdout.

So from my side this issue here is closed, since I can no longer reproduce it, thanks both of you for your quick support! :pray:

What are the workarounds for that issue? Just disable SEATBELTS?

ptheywood commented 6 months ago

Yes, I'm running in a plain zsh nothing fancy around this shell.

It should be fine then as far as I'm aware.

So from my side this issue here is closed, since I can no longer reproduce it, thanks both of you for your quick support! 🙏

No problem, I'll close this issue for now but feel free to re-open it if you re-encounter the original problem.

What are the workarounds for that issue? Just disable SEATBELTS?

Yes, a build disabling seatbelts via FLAMEGPU_SEATBELTS=OFF should disable the radius factor check (but error messages will be less helpful in general unfortunately, although model runtimes will improve due to less checks).

Unfortuantely due to other commitments I'm not sure when I'll have time to fully resolve #1177 (via #1182).

Robadob commented 5 months ago

What are the workarounds for that issue? Just disable SEATBELTS?

In addition to disabling seatbelts (FLAMEGPU_SEATBELTS=OFF), you can also switch to using non-wrapped spatial messages.

Or simply comment out the check (as you are compiling yourself), let me know if you would like the line numbers. Adjusting the environment size slightly would also probably work, and might be a sensible temporary patch on our part.