babl-ws / babl

Low-latency WebSocket Server
https://babl.ws
Apache License 2.0
68 stars 22 forks source link

Unsafe operation causes InternalError on ARM (RPI 4) #53

Closed eliquinox closed 3 years ago

eliquinox commented 3 years ago

Running WS server on the following system:

Raspberry Pi 4 Model B 8GB OS: Ubuntu Server 20.04 (64 bit)

Exception in thread "main" java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
    at com.aitusoftware.babl.monitoring.MappedSessionContainerStatistics.<init>(MappedSessionContainerStatistics.java:52)
    at com.aitusoftware.babl.websocket.SessionContainer.<init>(SessionContainer.java:125)
    at com.aitusoftware.babl.websocket.BablServer.initialiseServerInstance(BablServer.java:252)
    at com.aitusoftware.babl.websocket.BablServer.launch(BablServer.java:140)

The same program works perfectly from my Dell XPS laptop, running Ubuntu 18.04 with the following cpu:

eliquinox@eliquinox-XPS-15-9560:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Stepping:            9
CPU MHz:             800.022
CPU max MHz:         3800.0000
CPU min MHz:         800.0000
BogoMIPS:            5599.85
Virtualisation:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-7

I am willing to test any suggestions.

eliquinox commented 3 years ago

Additional information.

Running ./gradlew test on the above RPI spec produces the following failures:

WebSocketSessionPollModeAcceptanceTest > shouldHandleMultipleSessions() FAILED
    com.google.common.truth.AssertionErrorWithFacts at WebSocketSessionPollModeAcceptanceTest.java:114

MultipleWebSocketSessionDetachedSessionContainerAcceptanceTest > shouldHandleMultipleSessions() FAILED
    java.lang.InternalError at MultipleWebSocketSessionDetachedSessionContainerAcceptanceTest.java:82

SingleWebSocketSessionDirectSessionContainerAcceptanceTest > shouldEchoMediumAndLargePayload() FAILED
    java.lang.InternalError at SingleWebSocketSessionDirectSessionContainerAcceptanceTest.java:96

SingleWebSocketSessionDirectSessionContainerAcceptanceTest > shouldSendCloseResponseMessage() FAILED
    java.lang.InternalError at SingleWebSocketSessionDirectSessionContainerAcceptanceTest.java:96

SingleWebSocketSessionDirectSessionContainerAcceptanceTest > validationFailureShouldCauseDisconnect() FAILED
    java.lang.InternalError at SingleWebSocketSessionDirectSessionContainerAcceptanceTest.java:96

SingleWebSocketSessionDirectSessionContainerAcceptanceTest > shouldHandleSingleClient() FAILED
    java.lang.InternalError at SingleWebSocketSessionDirectSessionContainerAcceptanceTest.java:96

SingleWebSocketSessionDirectSessionContainerAcceptanceTest > shouldPropagateUpgradeRequestHeaders() FAILED
    java.lang.InternalError at SingleWebSocketSessionDirectSessionContainerAcceptanceTest.java:96

SingleWebSocketSessionDirectSessionContainerAcceptanceTest > shouldRespondToPings() FAILED
    java.lang.InternalError at SingleWebSocketSessionDirectSessionContainerAcceptanceTest.java:96

SingleWebSocketSessionDirectSessionContainerAcceptanceTest > shouldHandleUpgradeFailure() FAILED
    java.lang.InternalError at SingleWebSocketSessionDirectSessionContainerAcceptanceTest.java:96

MultipleWebSocketSessionBroadcastAcceptanceTest > shouldBroadcastOnMultipleTopics() FAILED
    java.lang.InternalError at MultipleWebSocketSessionBroadcastAcceptanceTest.java:91

MultipleWebSocketSessionDirectSessionContainerAcceptanceTest > shouldHandleMultipleSessions() FAILED
    java.lang.InternalError at MultipleWebSocketSessionDirectSessionContainerAcceptanceTest.java:68

BackPressureDirectSessionContainerAcceptanceTest > shouldHandleSingleClient() FAILED
    java.lang.InternalError at BackPressureDirectSessionContainerAcceptanceTest.java:64

DockerComposeIntegrationTest > shouldExposeMonitoringData() FAILED
    java.io.IOException at DockerComposeIntegrationTest.java:74
        Caused by: java.io.IOException at DockerComposeIntegrationTest.java:74

SessionStatisticsFileTest > insertionAndRemoval() FAILED
    java.lang.InternalError at SessionStatisticsFileTest.java:180

SessionStatisticsFileTest > shouldRemoveEntryInMiddleOfFile() FAILED
    java.lang.InternalError at SessionStatisticsFileTest.java:162

SessionStatisticsFileTest > shouldRemoveEntryAtEndOfFile() FAILED
    java.lang.InternalError at SessionStatisticsFileTest.java:180

SessionStatisticsFileTest > shouldContainMaximumEntryCount() FAILED
    java.lang.InternalError at SessionStatisticsFileTest.java:100

SessionStatisticsFileTest > shouldRemoveEntryAtStartOfFile() FAILED
    java.lang.InternalError at SessionStatisticsFileTest.java:162

SessionStatisticsFileTest > shouldAddNewEntries() FAILED
    java.lang.InternalError at SessionStatisticsFileTest.java:72

Let me know if you need any specific traces of the above failures

epickrram commented 3 years ago

Hi, thank you for the report. ARM is not a target platform at the moment, and we don't have an RPI to reproduce on. If you would like to investigate, I would suggest looking at any hs_err_pid files output during the test run. Please raise a PR if you identify any changes that fix the issue you are seeing.

epickrram commented 3 years ago

@eliquinox the only other things that comes to mind is that it could be an alignment issue. There are some resources suggesting that ARM has different alignment requirements for certain instructions. You could try modifying the offsets in the various Mapped* files to all be 64-bits wide for instance (or 128-bit depending on architecture [1]). You'll need to trace where all those alignments start from. For instance, the alignment here:

https://github.com/babl-ws/babl/blob/master/src/main/java/com/aitusoftware/babl/monitoring/MappedSessionContainerStatistics.java#L25

is offset by 12 bytes here:

https://github.com/babl-ws/babl/blob/master/src/main/java/com/aitusoftware/babl/monitoring/ServerMarkFile.java#L30

You may also have issues if the cache-line length is not 64 bytes.

1) https://community.arm.com/developer/ip-products/processors/f/cortex-m-forum/7154/alignment-in-arm

eliquinox commented 3 years ago

Appreciate the input, @epickrram. I have been experimenting with alignments of buffer offsets and sizes to 32 bits:

https://github.com/eliquinox/babl

I have managed to fix a number of alignment issues using Aeron source as inspiration. Currently stuck on maskingKeyBytes in FrameDecoder and I frankly do not know what can possibly be wrong with accessing elements of a 4-byte array on ARM, but an InternalError is thrown at the line of the array's instantiation.

epickrram commented 3 years ago

Firstly, thank you for taking the time to try to improve the project, and it's good to hear that you are making progress.

You say that an InternalError is thrown at the line of the array's instantiation, but that would imply that the processor doesn't allow valid Java programs to execute (i.e. declaring and initialising a 4-byte array).

I think there's a clue in the error message recent unsafe memory access operation - the unsafe access is probably somewhere 'close by' in the temporal sense.

Looking at the code, the most recent native-memory accesses before the FrameDecoder is instantiated is started here:

https://github.com/babl-ws/babl/blob/master/src/main/java/com/aitusoftware/babl/websocket/SessionFactory.java#L66

and ends up performing a write here:

https://github.com/babl-ws/babl/blob/master/src/main/java/com/aitusoftware/babl/monitoring/MappedSessionStatistics.java#L269

Can you double-check that alignments here are correct?

babl-ws commented 3 years ago

Hi @eliquinox any progress on this? Anything else I can help with? I'd be happy to merge a PR to support ARM if you are able to get it to work.

eliquinox commented 3 years ago

Hi @epickrram. I did not make any significant further progress on this issue. I have switched target architecture to x86, and thus have no problems described any longer. As such, I do not think that I will have bandwidth to work on this in the near future, so unless you will, you can go ahead and close this issue; you may want to specify the CPU requirements somewhere in the README.md accordingly.

epickrram commented 3 years ago

Understood. Thanks for the initial investigation work. I may pick it up again in future.