chrxh / alien

ALIEN is a CUDA-powered artificial life simulation program.
https://alien-project.org
BSD 3-Clause "New" or "Revised" License
3.48k stars 106 forks source link

Simulator Keeps crashing #73

Closed TheBarret closed 9 months ago

TheBarret commented 10 months ago

The simulator crashes from time to time using the multiplier tool or when i hit (play) start, there is no error message of any kind the app exits immediately, it happens with an empty world or full world, so I assume it has something to do with something deeper in the engine.

What could this be?

chrxh commented 10 months ago

Hi, could you please start ALIEN with the command line argument -debug and then posting the log.txt after the crash occurred?

TheBarret commented 10 months ago

Hi, could you please start ALIEN with the command line argument -debug and then posting the log.txt after the crash occurred?

I will report here as soon as I have sufficient (debug) because I dont know what triggers it, meanwhile i keep that flag on so when it does, i can convey that to you.

chrxh commented 10 months ago

Alright! The simulation runs a bit slower (~15%) with this flag on.

TheBarret commented 10 months ago

Alright! The simulation runs a bit slower (~15%) with this flag on.

2023-09-20 19-03-30: device 0 is set
2023-09-20 19-03-30: initialize simulation
2023-09-20 19-03-32: resize arrays
2023-09-20 19-03-32: cell array size: 300000
2023-09-20 19-03-32: particle array size: 300000
2023-09-20 19-03-32: auxiliary data size: 300000
2023-09-20 19-03-32: 710 MB GPU memory used
2023-09-20 19-03-46: resize arrays
2023-09-20 19-03-46: cell array size: 300000
2023-09-20 19-03-46: particle array size: 300000
2023-09-20 19-03-46: auxiliary data size: 900000
2023-09-20 19-03-46: 723 MB GPU memory used
2023-09-20 19-03-49: resize arrays
2023-09-20 19-03-49: cell array size: 300000
2023-09-20 19-03-49: particle array size: 300000
2023-09-20 19-03-49: auxiliary data size: 2700000
2023-09-20 19-03-49: 728 MB GPU memory used
2023-09-20 19-14-33: CUDA error. Location: Base.cuh:225 code=719(cudaErrorLaunchFailure) "cudaMemcpy(&result, source, sizeof(T), cudaMemcpyDeviceToHost)"
2023-09-20 19-14-33: CUDA error. Location: CudaSimulationFacade.cu:152 code=46(cudaErrorDevicesUnavailable) "cudaGraphicsMapResources(1, &cudaResourceImpl)"
2023-09-20 19-14-33: network: logout
2023-09-20 19-14-34: close simulation
chrxh commented 10 months ago

It doesn't look like the log is from a debug mode, since the cudaErrorLaunchFailure error is not generated by cudaMemcpy, but by a previous kernel call. When the debug mode is enabled, there is a synchronization point after each kernel call and an error checking. Only then the error information from the log is useful.

You can check if ALIEN is in debug mode, if after starting via alien.exe -debug (on Windows) the loading screen shows DEBUG.

TheBarret commented 10 months ago

It doesn't look like the log is from a debug mode, since the cudaErrorLaunchFailure error is not generated by cudaMemcpy, but by a previous kernel call. When the debug mode is enabled, there is a synchronization point after each kernel call and an error checking. Only then the error information from the log is useful.

You can check if ALIEN is in debug mode, if after starting via alien.exe -debug (on Windows) the loading screen shows DEBUG.

Oh, I dont know what else to give you, I did you use that -debug flag and this is all it gives:

2023-09-22 12-42-08: CUDA error. Location: SimulationKernelsLauncher.cu:86 code=719(cudaErrorLaunchFailure) "cudaGetLastError()"
2023-09-22 12-42-08: CUDA error. Location: CudaSimulationFacade.cu:152 code=46(cudaErrorDevicesUnavailable) "cudaGraphicsMapResources(1, &cudaResourceImpl)"
chrxh commented 10 months ago

Ok, that looks better. On the Discord server you wrote it occurs after multiplication of structure during a running simulation? Can you please give more details. Can you give an instruction to reproduce the bug?

TheBarret commented 10 months ago

Ok, that looks better. On the Discord server you wrote it occurs after multiplication of structure during a running simulation? Can you please give more details. Can you give an instruction to reproduce the bug?

The bug occurs 99% of the time when I add random spores and at the moment of the sim starting and begin to replicate the spores (few seconds into the run) it exits the app, this happens usually 3 times in a row and interestingly after that no crashes anymore, I can run your sim for hours (days probably).

Also helps maybe to inform you of my specs. I run a bulldozer AMD (FX8350) with 20GB ram and a Lightweight GFX Geforce 1050Ti (4GB)

chrxh commented 10 months ago

Ok, I'm still not able to reproduce the crash. Maybe I have not yet understood the precise steps. I guess the following steps from the description:

  1. Create a new sim (here I need the sim parameters)
  2. Add a spore (I need the genome code)
  3. Run simulation
  4. Multiply the spore in a running sim (multiplication factor is 100?) And then it crashes non-deterministically after few seconds.

Is that correct?

TheBarret commented 10 months ago

Could it be memory issue, because this GFX card does not have plenty of it , maximum of 4gb. because if i push my GPU little too hard on the Stable Diffussion it too succumbs to insufficient memory (SD 2.0XL for instance is a no go)

TheBarret commented 10 months ago

Yeah it seems when I use the multiplicative tool, it gave me an error as message box.

image

2023-09-25 13-00-18: DEBUG mode
2023-09-25 13-00-18: set windowed mode
2023-09-25 13-00-18: starting ALIEN v4.3.0
2023-09-25 13-00-19: network: login user 'TheBarret'
2023-09-25 13-00-19: network: get simulation list
2023-09-25 13-00-19: network: get user list
2023-09-25 13-00-19: network: get liked simulations
2023-09-25 13-00-24: 1 CUDA device found
2023-09-25 13-00-24: device 0: NVIDIA GeForce GTX 1050 Ti with compute capability 6.1
2023-09-25 13-00-24: device 0 is set
2023-09-25 13-00-24: initialize simulation
2023-09-25 13-00-26: resize arrays
2023-09-25 13-00-26: cell array size: 300000
2023-09-25 13-00-26: particle array size: 300000
2023-09-25 13-00-26: auxiliary data size: 300000
2023-09-25 13-00-26: 707 MB GPU memory used
2023-09-25 13-00-26: resize arrays
2023-09-25 13-00-26: cell array size: 300000
2023-09-25 13-00-26: particle array size: 300000
2023-09-25 13-00-26: auxiliary data size: 4520073
2023-09-25 13-00-26: 719 MB GPU memory used
2023-09-25 13-00-33: close simulation
2023-09-25 13-00-33: device 0 is set
2023-09-25 13-00-33: initialize simulation
2023-09-25 13-00-35: resize arrays
2023-09-25 13-00-35: cell array size: 300000
2023-09-25 13-00-35: particle array size: 300000
2023-09-25 13-00-35: auxiliary data size: 300000
2023-09-25 13-00-35: 707 MB GPU memory used
2023-09-25 13-04-45: resize arrays
2023-09-25 13-04-45: cell array size: 300000
2023-09-25 13-04-45: particle array size: 300000
2023-09-25 13-04-45: auxiliary data size: 900000
2023-09-25 13-04-45: 719 MB GPU memory used
2023-09-25 13-06-09: message dialog showing: 'Non-overlapping copies could not be created.'
2023-09-25 13-06-09: resize arrays
2023-09-25 13-06-09: cell array size: 300000
2023-09-25 13-06-09: particle array size: 300000
2023-09-25 13-06-09: auxiliary data size: 42000312
2023-09-25 13-06-09: 837 MB GPU memory used
2023-09-25 13-06-16: message dialog showing: 'Non-overlapping copies could not be created.'
2023-09-25 13-20-28: network: refresh login
2023-09-25 13-21-31: close simulation
2023-09-25 13-21-31: device 0 is set
2023-09-25 13-21-31: initialize simulation
2023-09-25 13-21-33: resize arrays
2023-09-25 13-21-33: cell array size: 300000
2023-09-25 13-21-33: particle array size: 300000
2023-09-25 13-21-33: auxiliary data size: 300000
2023-09-25 13-21-33: 707 MB GPU memory used
2023-09-25 13-22-28: resize arrays
2023-09-25 13-22-28: cell array size: 300000
2023-09-25 13-22-28: particle array size: 300000
2023-09-25 13-22-28: auxiliary data size: 900000
2023-09-25 13-22-28: 719 MB GPU memory used
2023-09-25 13-22-37: CUDA error. Location: SimulationKernelsLauncher.cu:99 code=719(cudaErrorLaunchFailure) "cudaGetLastError()"
2023-09-25 13-22-37: CUDA error. Location: CudaSimulationFacade.cu:153 code=46(cudaErrorDevicesUnavailable) "cudaGraphicsMapResources(1, &cudaResourceImpl)"
2023-09-25 13-22-37: network: logout
2023-09-25 13-22-37: close simulation
chrxh commented 10 months ago

Please only one bug per issue ;) I was able to reproduce the bug and it is fixed now in the latest commit.

TheBarret commented 10 months ago

Thank you very much for your effort and time.

TheBarret commented 9 months ago

It still ocours and i finally captured what the logger did not appended, and i think this might be the details you where asking for:

Microsoft Windows [Version 10.0.19045.3448]
(c) Microsoft Corporation. All rights reserved.

d:\Apps\alien2\bin>alien
Not implemented error. File: D:\dev\alien\source\EngineGpuKernels\GenomeDecoder.cuh, Line: 221
Not implemented error. File: D:\dev\alien\source\EngineGpuKernels\GenomeDecoder.cuh, Line: 221
Not implemented error. File: D:\dev\alien\source\EngineGpuKernels\GenomeDecoder.cuh, Line: 221
...(gets repeated a lot of times)...

An uncaught exception occurred: CUDA error. Location: CudaSimulationFacade.cu:153 code=46(cudaErrorDevicesUnavailable) "cudaGraphicsMapResources(1, &cudaResourceImpl)"