Terasology / ModuleTestingEnvironment

3 stars 15 forks source link

MTE's own test suite is very slow #49

Closed keturn closed 2 years ago

keturn commented 3 years ago

This issue is not "Tests which use @ExtendWith(MTEExtension.class) are slow,"
it is "Running :modules:ModuleTestingEnvironment:test is slow."

On my workstation it takes about eight minutes, which is about as much as everything else (including engine!) put together. And it's not because MTE has a zillion tests!

Maybe in the process of fixing this, we discover what's been making MTE-using tests so slow to boot up in general.

keturn commented 3 years ago

While testing https://github.com/MovingBlocks/Terasology/pull/4611 , the test run failed with:

org.gradle.api.internal.tasks.testing.TestSuiteExecutionException: Could not complete execution for Gradle Test Executor 22.
    at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:63)
        […]
    at org.gradle.api.internal.tasks.testing.worker.TestWorker.stop(TestWorker.java:133)
        […]
    at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
    at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
    at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
    at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)
    at org.terasology.engine.rendering.assets.texture.PNGTextureFormat.load(PNGTextureFormat.java:35)
    at org.terasology.engine.rendering.assets.texture.PNGTextureFormat.load(PNGTextureFormat.java:19)
    at org.terasology.assets.module.UnloadedAssetData$AssetSourceResolver.load(UnloadedAssetData.java:292)
    at org.terasology.assets.module.UnloadedAssetData.load(UnloadedAssetData.java:180)
    at org.terasology.assets.module.ModuleAssetDataProducer.getAssetData(ModuleAssetDataProducer.java:228)
    at org.terasology.assets.AssetType.lambda$reload$2(AssetType.java:359)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at org.terasology.assets.AssetType.reload(AssetType.java:357)
    at org.terasology.assets.AssetType.getNormalAsset(AssetType.java:386)
    at org.terasology.assets.AssetType.getAsset(AssetType.java:262)
    at org.terasology.assets.management.AssetManager.getAsset(AssetManager.java:223)
    at org.terasology.engine.rendering.assets.font.FontMaterialProducer.getAssetData(FontMaterialProducer.java:61)
jdrueckert commented 3 years ago

I wonder if we can formulate this issue less vague - "very slow" is subjective (which is not me saying, it's not very slow :wink: ) and makes it hard to assess when this issue would actually be resolved. Which durations do we expect? What amount of improvement would be sufficient? A.k.a what's the "Definition of Done" for this issue?

jdrueckert commented 3 years ago

Accelerate MTE test suite

keturn commented 3 years ago

Probably the real achievement here would be to understand what is taking so much time and why.

But if we want a range to aim for, I'd say 5 seconds per test would make it merely "really slow" instead of the "extraordinarily slow" it is now. That'd be about two minutes for the suite of 22 tests.

keturn commented 3 years ago

After saying that, it did run in under two minutes on my branch just now. (though not every test was under 5s)

Does that branch speed it up so much? Or is it faster because I only have iota modules, and it would be slower in an omega workspace? 🤔

Cervator commented 3 years ago

Or is it faster because I only have iota modules

We've definitely seen issues in the past caused/aggravated by the presence of other modules, even if they aren't activated for whatever is going on. But on the other hand it is slow in Jenkins which has the minimum amount of modules.

If the tests in the MTE itself are just more memory hungry, are we sure they'll take advantage of all available memory? I think the current build agents are configured to offer up to 6 GB, but I have no idea if Gradle is allowing that (especially if it launches in different contexts / peripheral ones like module test cases rather than a plain engine build)

skaldarnar commented 3 years ago

The devil is in the detail here: The tests are not running out of heap space, but out of native memory space. The eBay team wrote a SRE Case Study on this, and there a couple of other blog posts on troubleshooting it.

Caused by: java.lang.OutOfMemoryError: Direct buffer memory

It does look like we accumulate more and more used memory in the native space each time we start up a new engine for an MTE test, seemingly without cleaning up that memory in between. I can run a couple of tests locally just fine, but once crossing a specific threshold the test executor will crash eventually.


In a bit more detail: it all starts from StateLoading, which tries to push a screen, and then we do a bunch of asset (re-)loading. This happens each time we start an MTE test.

    ...
    at org.terasology.engine.core.modes.StateLoading.init(StateLoading.java:139)
    at org.terasology.engine.core.TerasologyEngine.switchState(TerasologyEngine.java:591)
    at org.terasology.engine.core.TerasologyEngine.processPendingState(TerasologyEngine.java:580)
    at org.terasology.engine.core.TerasologyEngine.tick(TerasologyEngine.java:484)
    at org.terasology.moduletestingenvironment.ModuleTestingEnvironment.createHost(ModuleTestingEnvironment.java:475)
    at org.terasology.moduletestingenvironment.ModuleTestingEnvironment.setup(ModuleTestingEnvironment.java:167)
    at org.terasology.moduletestingenvironment.MTEExtension.beforeAll(MTEExtension.java:93)
    ...

It does look like we have a native memory leak in there, which becomes apparent here because we start and stop the engine a bunch of times... Eventually, the test executor will crash with the above mentioned OOM.

Caused by: java.lang.OutOfMemoryError: Direct buffer memory
    at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
    at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
    at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)
    at org.terasology.engine.rendering.assets.texture.AWTTextureFormat.convertToTextureData(AWTTextureFormat.java:78)
    at org.terasology.engine.rendering.assets.texture.AWTTextureFormat.load(AWTTextureFormat.java:50)
    at org.terasology.engine.rendering.assets.texture.AWTTextureFormat.load(AWTTextureFormat.java:25)
    at org.terasology.gestalt.assets.format.producer.UnloadedAssetData$AssetSourceResolver.load(UnloadedAssetData.java:311)
    at org.terasology.gestalt.assets.format.producer.UnloadedAssetData.load(UnloadedAssetData.java:181)
    at org.terasology.gestalt.assets.format.producer.AssetFileDataProducer.getAssetData(AssetFileDataProducer.java:227)
    at org.terasology.gestalt.assets.AssetType.lambda$reload$2(AssetType.java:360)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at org.terasology.gestalt.assets.AssetType.reload(AssetType.java:358)
    ...
keturn commented 3 years ago

It seems likely that some of the things causing problems for re-using the same process for multiple MTE tests might also have been causing problems for switching between games in the same Terasology run. There was some work on that in https://github.com/MovingBlocks/Terasology/pull/4829; we should review that and see if MTE is doing the appropriate dispose steps.

keturn commented 2 years ago

Follow-ups:

[closing because I can't transfer issues between @Terasology and @MovingBlocks]