chrxh / alien

ALIEN is a CUDA-powered artificial life simulation program.
https://alien-project.org
BSD 3-Clause "New" or "Revised" License
4.41k stars 134 forks source link

Rendering broken on Nvidia 10 series (1080ti & 1070ti) #21

Closed jhugard closed 2 years ago

jhugard commented 3 years ago

When first started can pan, zoom, edit, etc. But, as soon as Run is hit, rendering completely breaks: scroll bars indicate zoom is working, but display never updates.

In addition, once Run is clicked program hangs on exit (one CPU is pegged at 100%), regardless of whether running or paused.

Have updated to current CUDA (11.3.1) and current NVidia driver (466.77) with no luck.

I am running a 1070ti. Other reports of 1070ti and 1080ti failures can be found on Hacker News here.

chrxh commented 3 years ago

Hi, this sounds like a CUDA kernel timeout issue because the GUI thread still seems to be running. I had received reports that on Nvidia 10 series, the default CUDA thread block count might be too high. If this is the case for you, a simple solution would be to, for example, halve the number of blocks (under Settings -> General Settings -> Number of Blocks). Does this work for you?

Unfortunately, I'd not yet found a good heuristic for the default value.

jhugard commented 3 years ago

No luck. Cut threads and blocks to 32/64, as well as to 16/16. Also tried halving all other settings, but no change in symptoms.

chrxh commented 3 years ago

Hm... It seems to be difficult to get CUDA programs to work on different hardware. I had tested it on a 1050 Ti without problems. Do you experience the same issue with the simple physics example in examples/simulations/collisions.sim ? It uses only the physics part of the engine.

chrxh commented 3 years ago

sorry I actually meant obstacles.sim.

jhugard commented 3 years ago

Simulation runs until shortly after 2nd vertical bar hits the 3rd one. Can exit without hang until this happens.

So, obstacles.sim also fails.

domano commented 3 years ago

Having the same issue with a 1070 (not ti)

chrxh commented 3 years ago

Good to know that it's a more general problem. I've a wild guess as to what it might be. But it's hard for me to fix it if I can't reproduce the bug myself. I'll try to provide a special debug version and then come back to you.

UltraSabreman commented 3 years ago

I'm having the same issue with my 1080ti, including the same results with obstacles.sim.

scrps commented 3 years ago

I've actually managed to get obstacles.sim running on my 1060 6gb. I set number of threads per block at 1000, after changing it around it seems to fail when set to anything under 900. Hope that helps.

Edit: Realized I put 1080 instead of 1060.

chrxh commented 3 years ago

Thanks for all the info! I'll provide automatic tests that I used during the development. This should give a more precise indication. I'll get back to you when I have them ready.

chrxh commented 3 years ago

The installer of the latest version 2.5.3 now provides ~ 300 compiled integration tests. Could you please run them (Tests.exe in the bin folder)? Alternatively, you can compile them for yourself (it's the Tests project). Possibly 1-2 tests fail indeterministically (I haven't analyzed them yet... this can be ignored for the moment). Thanks a lot!

Aurel-C commented 3 years ago

When i run the tests with a 1070ti it gets stuck at running CleanupGpuTests.testCleanupCells (no failure).

chrxh commented 3 years ago

Ok, there we have a timeout. Unfortunately, this info is of limited help. This test checks a kind of garbage collector. It runs some test simulation where many cells are created and destroyed in the process and finally proves that there is no memory leak. There are many more subtle/fine-granular tests that will (unfortunately) be executed later in the test execution order because of alphabetical order. It would be very helpful to know all test results. I don't know any mechanism to terminate a CUDA kernel from host code (=on cpu). To achieve this, you have to set a timeout for the CUDA kernels in the OS. It's e.g. explained here https://forums.developer.nvidia.com/t/cuda-kernel-timeout/12160/5 Would be great if you could try this out.

If this is not working: A more time-consuming solution would be to skip this test with a command line parameter: Tests.exe --gtest_filter=-*CleanupGpuTests* But then, of course, we can have the same problem with another coarse-granular test that runs into timeout. Nevertheless, it would give me valuable information.

hermanTenuki commented 3 years ago

Same all symptoms from above on 1050Ti + i5 7400. In tests, stuck on CleanupGpuTests.testCleanupCells as well. Tried Tests.exe --gtest_filter=-*CleanupGpuTests*, stuck on ClusterGpuTests.testFusionOfHorizontalClusters. Then Tests.exe --gtest_filter=-*CleanupGpuTests*:*ClusterGpuTests* got me stuck on ClusterGpuWithManyThreadsPerBlockTests.regressionTestManyRectangleClusters, and so on with ClusterGpuWithOneBlockTests.regressionTestFusionAndHeavyCollision (not tried to ignore more of them at this point). i hope it somewhat helps you :/

UltraSabreman commented 3 years ago

I've attempted do disable the timeout using multiple methods, the linked TdrLevel Registry edit, going through Nsight Monitor, and other things I found while googling around, but nothings worked. Still gets stuck/times out on CleanupGpuTests.testCleanupCells.

chrxh commented 3 years ago

@hermanTenuki: This may be helpful. Perhaps the problem is located in the fusion algorithm. One can prevent fusions by setting the velocity threshold very high. So you can try the following: Load obstacles.sim simulation, open the dialog Settings -> Simulation parameters and set the simulation parameter cell -> fusion velocity to maybe 1000. Does it crash now?

chrxh commented 3 years ago

@hermanTenuki: Can you also please send me the list of tests which were successful? Thx!

scrps commented 3 years ago

@hermanTenuki: Can you also please send me the list of tests which were successful? Thx!

I've also been working on that, sorry for the wall of text. This was tested on Win 10, GTX 1060 6GB with the latest drivers. Any test listed without an error means it hung (maybe add a 5000-6000ms timeout to the tests code?):

CleanupGpuTests:

    Failed on:
    testCleanupCells
    testCleanupCellPointers
    testCleanupTokens

ClusterGpuTests:

    Failed on:
    testFusionOfHorizontalClusters
    testFastRotatingCluster
    regressionTestManyOverlappingRectangleClusters
    regressionTestManyRectangleClusters_concentratedAtUniverseBoundary

ClusterGpuWithManyThreadsPerBlockTests:

    All failed

ClusterGpuWithOneBlockTests:

    Failed on:      
    regressionTestFusionAndHeavyCollision

CommunicatorGpuTests:

    All failed

ConstructorGpuTests:

    Failed on:
    testConstructFirstCellOnHorizontalCluster_standardParameters
    testConstructFirstCellOnHorizontalCluster_nonStandardParameters1
    testConstructFirstCellOnHorizontalCluster_nonStandardParameters2
    testConstructFirstCellOnHorizontalCluster_ignoreDistanceOnFirstConstructedCell1
    testConstructFirstCellOnHorizontalCluster_ignoreDistanceOnFirstConstructedCell2
    testConstructFirstCellOnHorizontalCluster_rightHandSide
    testConstructFirstCellOnHorizontalCluster_errorNoEnergy
    testConstructFirstCellOnHorizontalCluster_otherClusterRightObstacle_safe
    testConstructFirstCellOnHorizontalCluster_otherClusterLeftObstacle_safe
    testConstructFirstCellOnHorizontalCluster_ownClusterObstacle_backward
    testConstructFirstCellOnWedgeCluster_leftHandSide
    testConstructFirstCellOnTiangleCluster
    testConstructFirstCellOnHorizontalCluster_withEmptyToken
    testConstructFirstCellOnHorizontalCluster_withDuplicatedToken
    testConstructFirstCellOnHorizontalCluster_finishWithoutSeparation
    testConstructFirstCellOnHorizontalCluster_finishWithSeparation
    testConstructFirstCellOnHorizontalCluster_finishWithSeparation_otherClusterRightObstacle_safe
    testConstructFirstCellOnHorizontalCluster_finishWithSeparationAndReduction
    testConstructSecondCellOnHorizontalCluster_standardParameters
    testConstructSecondCellOnHorizontalCluster_maxDistance
    testConstructSecondCellOnHorizontalCluster_rightHandSide
    testConstructSecondCellOnHorizontalCluster_leftHandSide
    testConstructSecondCellOnHorizontalCluster_errorNoEnergy
    testConstructSecondCellOnHorizontalCluster_otherClusterRightObstacle_safe
    testConstructSecondCellOnHorizontalCluster_otherClusterLeftObstacle_safe
    testConstructSecondCellOnSelfTouchingCluster_ownClusterObstacle_safe
    testConstructSecondCellOnHorizontalCluster_ownClusterObstacle_backward
    testConstructSecondCellOnHorizontalCluster_withEmptyToken
    testConstructSecondCellOnHorizontalCluster_withEmptyToken_tokenAnreadyOnConstructionSite
    testConstructSecondCellOnHorizontalCluster_withDuplicatedToken
    testConstructSecondCellOnHorizontalCluster_withDuplicatedToken_suppressed
    testConstructSecondCellOnHorizontalCluster_finishWithoutSeparation
    testConstructSecondCellOnHorizontalCluster_finishWithSeparation_standardPositionConstructorGpuTests.testConstructSecondCellOnHorizontalCluster_finishWithSeparation_nonStandardPosition
    testConstructSecondCellOnHorizontalCluster_finishWithSeparationAndReduction
    testConstructSecondCellOnHorizontalCluster_finishWithTokenAndSeparationAndReduction
    testConstructSecondCellOnHorizontalCluster_finishWithTokenAndSeparationAndReduction_suppressed
    testConstructSecondCellOnHorizontalCluster_finishWithSeparation_otherClusterRightObstacle_safe
    testConstructThirdCellOnHorizontalCluster_standardParameters
    testConstructThirdCellOnLineCluster_nonStandardParameters
    testConstructThirdCellOnLineCluster_tokenOnConstructionSite
    testConstructThirdCellOnLineCluster_noMultipleConnections_limitMaxConnections
    testRotationOnlyOnHorizontalCluster
    testRotationOnlyOnHorizontalCluster_otherClusterObstacle_safe
    testRotationOnlyOnHorizontalCluster_ownClusterObstacle_safe
    testLargeCluster_limitSize
    testParallelConstructionFromDifferentSources
    testParallelConstructionFromDifferentConstructors_manyIsolatedClusters
    testParallelConstructionFromDifferentConstructors_touchingClusters
    testMultipleConnectedConstructionSites_errorConnection

ConstructorGpuWithHighBlockCountTests:

    All failed

GpuBenchmark:

    All failed
    testClusterAndParticleMovement caused tests.exe to crash with:
    Error: unknown file: error: C++ exception with description "CUDA error at CudaSimulation.cu:192 code=700(cudaErrorIllegalAddress) "cudaGetLastError()"" thrown in the test body.
    CUDA error at D:\temp\alien\source\EngineGpuKernels\Array.cuh:47 code=46(cudaErrorDevicesUnavailable) "cudaMemcpy(&data, _data, sizeof(T*), cudaMemcpyDeviceToHost)"

PropulsionGpuTests:

    Failed on:
    testParallelization1
    testParallelization2

ReplicatorGpuTests:

    All failed
    Error: unknown file: error: C++ exception with description "<unspecified file>(1): expected value" thrown in the test body.

ReplicatorGpuTestsWithManyThreads:

    All failed
    Error: unknown file: error: C++ exception with description "<unspecified file>(1): expected value" thrown in the test body.

ScannerGpuTests:

    Failed on:
    ScannerGpuTests.testScanRestart2

SensorGpuTests:

    Failed on:
    testSearchVicinity_success
    testSearchByAngle_success
    testSearchFromCenter_success
    testSearchTowardCenter_success

TokenEnergyGuidanceGpuTests:

    Failed on:
    testParallelization

TokenSpreadingGpuTests:

    Failed on:
    testMovementWithFittingBranchNumbers_manyLargeClusters
    testMovementWithEncounter
    testMovementDuringDecomposition
    testCreationAfterSecondFusion
    testMovementWithTooManyTokens
    testMassiveMovements
    testMovementOnDestroyedCell_closeCell
    regressionTestManyStickyRotatingTokenClusters
    testCellDecayDueToTokenUsage

WeaponGpuTests:

    Failed on:
    testStrike
    Error: D:\temp\alien\source\Tests\WeaponGpuTests.cpp(155): error: Expected equality of these values:
          -expectedEnergyLoss
            Which is: -11
          *result.energyDiffOfTarget1
            Which is: -10.5
chrxh commented 3 years ago

Thanks a lot for your work! Something fundamental doesn't seem to be working. I'm trying to figure out what it could be based on the test list. But I'm afraid that without the hardware it will be hard for me to identify the problem. I'll probably compose a test with detailed console output. I can't think of anything else right now :-/

Yes, I'd like to be able to automatically terminate a test after x seconds timeout. But it doesn't seem that easy to terminate a running CUDA kernel from CPU code. If you have any ideas, I would be very grateful!

Aurel-C commented 3 years ago

Setting fusion velocity to 1000 fixes collisions.sim and obstacles.sim but not default.sim.

chrxh commented 3 years ago

I've prepared a test with verbose console output. It's stored in alien-project.org/files/Debug.zip. It contains the compiled Test project. Please run it with Test.exe --gtest_filter=*testFusionOfHorizontalClusters. Only one GPU thread is used (at least I hope that the error can be reproduced with a single thread). Could you please tell me the output (the last line will do assuming it hangs)?

UltraSabreman commented 3 years ago

Ran as requested, output here: https://termbin.com/0lyi

I also ran the full set of tests while dumping the output to a file. Killed the process when the file reached half a gig. The hangup seems to have the program bouncing around between lines 180, 182, 178, 170, and 168, in the D:\temp\alien\source\EngineGpuKernels\SimulationKernels.cuh file.

chrxh commented 3 years ago

The other tests partly use thousands of GPUs threads, that's why the console can't cope with the output anymore. I think we need to play some more ping pong :) I modified the above test (testFusionOfHorizontalClusters) to run with multiple GPU threads again (so it should now run into timeout), but only create output at sync points: alien-project.org/files/Debug.zip. This would at least allow a little more precise localization in the code.

UltraSabreman commented 3 years ago

Hangs on D:\temp\alien\source\EngineGpuKernels\SimulationKernels.cuh:157

Full output here if you need it: https://termbin.com/mu0x

chrxh commented 3 years ago

Next round: I now use only 1 thread block (but with >1 threads) and more outputs.

UltraSabreman commented 3 years ago

Hangs on D:\temp\alien\source\EngineGpuKernels\SimulationKernels.cuh:81. Output: https://termbin.com/rknp

chrxh commented 3 years ago

That helps! Thanks! Is there still a timeout? If not, can you please run all tests?

UltraSabreman commented 3 years ago

Major progress, I can run the tests now, however it errors out about 1/3 of the way through:

[ RUN      ] ClusterGpuTests.regressionTestManyOverlappingRectangleClusters
Qt has caught an exception thrown from an event handler. Throwing
exceptions from an event handler is not supported in Qt.
You must not let any exception whatsoever propagate through Qt code.
If that is not possible, in Qt 5 you must at least reimplement
QCoreApplication::notify() and catch all exceptions there.

D:\temp\alien\source\Tests\ClusterGpuTests.cpp(916): error: Expected: IntegrationTestHelper::runSimulation(300, _controller) doesn't throw an exception.
  Actual: it throws.
CUDA error at D:\temp\alien\source\EngineGpuKernels\Array.cuh:47 code=46(cudaErrorDevicesUnavailable) "cudaMemcpy(&data, _data, sizeof(T*), cudaMemcpyDeviceToHost)"
chrxh commented 3 years ago

This exception seems to occur during termination (maybe something different from the problem above). The actual problem has something to do with thread divergence. The behavior of some Geforce 10-series cards are apparently different from the 20-series and 30-series. I'll have to think about this in more detail. For testing I've included alien.exe in the Debug.zip. Maybe you can try if the standard example is working now.

UltraSabreman commented 3 years ago

It works. I ran every simulation included in the examples folder. Only planet gaia and replicator - ecosystem crashed. Gaia after a minute or two of the simulation running, and ecosytem crashes every time it's started. I sent error reports on both crashes out. Let me know if you need more.

chrxh commented 3 years ago

It'll take a little while. The background of the problem is as follows: In CUDA it's very costly to implement critical sections and one should avoid this by organizing the data appropriately. Unfortunately, this isn't possible here everywhere due to the complexity of the simulation code. The implementation of the mutex doesn't seem to work on some cards of the 10-series. Therefore a few code places have to be reorganized/rewritten.

chrxh commented 3 years ago

I'm in the process of rewriting and improving the entire simulation code for next major version. In the new engine this blocking issue should not appear anymore. Could someone with a Geforce 10 card please test alien-project.org/files/alien-version3-preview.zip (build from branch features/version3 ) and run the simulation, which will be loaded automatically after startup? This simulation is intended to show machines that create elastic ribbons which in turn fold into curved structures. It would be nice to let me know if this works. Thanks in advance! Many functions are disconnected (e.g. editor does not work) and there is still a lot to do. This is only an intermediate result. You can zoom and apply forces to the particles with the mouse button pressed while the simulation is running (in "Action mode").

nodeSpace commented 3 years ago

I have a gtx 1060 3GB, after running alien-project.org/files/alien-version3-preview.zip (on windows 10) the view window freezes after a second or so. Trying to step through it frame by frame and after about 150~ish steps the entire program freezes (not just the view window) and I get the alien.exe is not reponding windows popup.

chrxh commented 3 years ago

Thanks a lot for your help! I've added a correction and an additional console log output: It would still help me a lot if you could run the program (alien-project.org/files/alien-version3-preview.zip) again. Does it freeze again? If so, could you please send me the last output from the console (which is in the background and opens in addition to the main window). Thanks a lot! I know this is a bit of a nuisance. But unfortunately I see no other way to get information about the problem because I cannot reproduce it with my hardware.

nodeSpace commented 3 years ago

Just ran the new version, nope, no freeze this time!

chrxh commented 3 years ago

Great, thank you very much! The performance is greatly reduced in this version due to the massive console outputs. In the future, I'll probably need some further tests from time to time.

chrxh commented 2 years ago

I've been working a lot on the new major version lately and it would be a great help for me to know if everything works on an Nvidia 10 card. There are many visible changes now. In particular, I've changed the gui framework and now use Dear imgui (...and thus got rid of many problems).

Could someone please give this a try? The source code is on the feature branch https://github.com/chrxh/alien/tree/features/version3. (A Windows binary/compiled version is available on https://alien-project.org/files/alien-version3-preview.zip.)

nodeSpace commented 2 years ago

Seems to work pretty well, no issues so far for me!

hermanTenuki commented 2 years ago

Now it works fine on 1050ti, well done

nodeSpace commented 2 years ago

I managed to crash the new version when I maxed out radiation strength and set minimum energy very low as well as removing the spot areas. I think its because it tried to use more memory then my gpu had as these settings cause many cells to spawn in simultaneously. (this is probably more of an adversarial case though, it ran fine before I set the extreme values)

chrxh commented 2 years ago

Super, thanks for all the testings! This is still a work in progress. I haven't yet tried all extreme situations. There is now a heuristic that constantly predicts the near future memory requirements of particles, cells, etc. and the arrays in the GPU are automatically adjusted to that prediction. Maybe it can't handle very sudden and large changes yet. Or it is, as you said, that the memory in the GPU was not sufficient anymore. EDIT: You can get more information if you enable the log window and switch to "verbose".

If you are interested, there are more examples to play around with: https://github.com/chrxh/alien/tree/features/version3/examples/simulations There is currently no built-in editor like in the old version. The examples were constructed in the old one and then imported.

It will take a few more months until the new built-editor is ready.

chrxh commented 2 years ago

I close this issue because the problem is now solved.