AntelopeIO / leap

C++ implementation of the Antelope protocol
Other
116 stars 70 forks source link

Get error: x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform #1216

Closed matthewdarwin closed 1 year ago

matthewdarwin commented 1 year ago

Occasionally nodeos will issue error message "x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform". Seems to happen on next restart after kill -9 of nodeos.

Nodeos issues this message after already loading state (in heap mode) and then exits. Our systemd unit will restart nodeos and then it runs fine.

What's going on here? Why doesn't it check this before loading all the state? Wasting cycles loading RAM and then unloading RAM. What is this test doing?

spoonincode commented 1 year ago

Nodeos issues this message after already loading state (in heap mode) and then exits

This message is probably a red herring then as it's only a informational warning and doesn't cause nodeos to exit per se. When you see this error what kind of exit code does nodeos exit with?

matthewdarwin commented 1 year ago

Full log files have been sent privately.

image

May 25 18:16:15 waxtest-ship11 nodeos[429697]: info  2023-05-25T18:16:15.366 nodeos    http_plugin.cpp:378           add_async_handler    ] add api url: /v1/net/connect
May 25 18:16:15 waxtest-ship11 nodeos[429697]: info  2023-05-25T18:16:15.366 nodeos    http_plugin.cpp:378           add_async_handler    ] add api url: /v1/net/connections
May 25 18:16:15 waxtest-ship11 nodeos[429697]: info  2023-05-25T18:16:15.366 nodeos    http_plugin.cpp:378           add_async_handler    ] add api url: /v1/net/disconnect
May 25 18:16:15 waxtest-ship11 nodeos[429697]: info  2023-05-25T18:16:15.366 nodeos    http_plugin.cpp:378           add_async_handler    ] add api url: /v1/net/status
May 25 18:16:15 waxtest-ship11 nodeos[429697]: info  2023-05-25T18:16:15.366 nodeos    producer_plugin.cpp:1222      plugin_startup       ] producer plugin:  plugin_startup() begin
May 25 18:16:19 waxtest-ship11 nodeos[429697]: warn  2023-05-25T18:16:19.003 read-0    executor.cpp:142              executor             ] x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform
May 25 18:16:19 waxtest-ship11 nodeos[429697]: warn  2023-05-25T18:16:19.079 read-1    executor.cpp:142              executor             ] x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform
May 25 18:16:19 waxtest-ship11 nodeos[429697]: info  2023-05-25T18:16:19.157 nodeos    producer_plugin.cpp:1316      plugin_shutdown      ] exit shutdown
May 25 18:16:19 waxtest-ship11 nodeos[429697]: info  2023-05-25T18:16:19.157 nodeos    net_plugin.cpp:3908           plugin_shutdown      ] shutdown..
May 25 18:16:19 waxtest-ship11 nodeos[429697]: info  2023-05-25T18:16:19.157 nodeos    net_plugin.cpp:2722           plugin_shutdown      ] close 0 connections
May 25 18:16:19 waxtest-ship11 nodeos[429697]: warn  2023-05-25T18:16:19.158 read-2    executor.cpp:142              executor             ] x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform
May 25 18:16:19 waxtest-ship11 nodeos[429697]: info  2023-05-25T18:16:19.195 nodeos    net_plugin.cpp:3912           plugin_shutdown      ] exit shutdown
May 25 18:16:19 waxtest-ship11 nodeos[429697]: CHAINBASE: Writing "state" database file, this could take a moment...
May 25 18:16:20 waxtest-ship11 nodeos[429697]: CHAINBASE: Writing "state" database file, 3% complete...
May 25 18:16:49 waxtest-ship11 nodeos[429697]: CHAINBASE: Writing "state" database file, 89% complete...
May 25 18:16:49 waxtest-ship11 nodeos[429697]: CHAINBASE: Writing "state" database file, flushing buffers...
May 25 18:17:14 waxtest-ship11 nodeos[429697]: CHAINBASE: Writing "state" database file, complete.
May 25 18:17:16 waxtest-ship11 nodeos[429697]: info  2023-05-25T18:17:16.282 nodeos    http_plugin.cpp:368           plugin_shutdown      ] exit shutdown
May 25 18:17:16 waxtest-ship11 nodeos[429697]: error 2023-05-25T18:17:16.284 nodeos    main.cpp:161                  main                 ] 3170000 producer_exception: Producer exception
May 25 18:17:16 waxtest-ship11 nodeos[429697]: read-only threads failed to start. num_threads_started: 0, time_slept_ms: 1000ms
May 25 18:17:16 waxtest-ship11 nodeos[429697]:     {"n":0,"t":1000}
May 25 18:17:16 waxtest-ship11 nodeos[429697]:     nodeos  producer_plugin.cpp:1283 plugin_startup
May 25 18:17:16 waxtest-ship11 nodeos[429697]:     {}
May 25 18:17:16 waxtest-ship11 nodeos[429697]:     nodeos  producer_plugin.cpp:1296 plugin_startup
May 25 18:17:16 waxtest-ship11 systemd[1]: nodeos.service: Main process exited, code=exited, status=254/n/a
spoonincode commented 1 year ago

There is a lot to talk about here.

The warning https://github.com/AntelopeIO/leap/blob/f53faff7a95568e4c68e7b6d4ecfd9feafa27a99/libraries/chain/webassembly/runtimes/eos-vm-oc/executor.cpp#L142-L144 was added late in the 2.0.0 development cycle as a paranoid sniff test against something else using the GS segment selector interfering with EOS VM OC. At the time, we were only aware of https://clang.llvm.org/docs/ShadowCallStack.html And that is expressly tested for here https://github.com/AntelopeIO/leap/blob/f53faff7a95568e4c68e7b6d4ecfd9feafa27a99/libraries/chain/webassembly/runtimes/eos-vm-oc/executor.cpp#L25-L29 besides, that feature has been removed as of clang 9.0 anyways.

This sniff test assumes that upon start of the application (and now, start of a new thread), the GS segment selector is initialized to 0x00. I am not aware of this being documented behavior anywhere, rather at the time of implementation it was simply empirically observed.

When looking at Linux source, we can find, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/process_64.c#n506 this initially looks like load_gs_index(0); but looking closer there is more to it since,


    if (static_cpu_has(X86_BUG_NULL_SEG)) {
        /* Loading zero below won't clear the base. */
        loadsegment(fs, __USER_DS);
        load_gs_index(__USER_DS);
    }

It turns out some older AMD CPUs ignore setting FS or GS to 0x00; AMD documents in their AMD64 spec that

When a null selector is loaded into FS or GS, the contents of the corresponding hidden descriptor register are not altered

and the kernel is looking for that sort of behavior with the X86_BUG_NULL_SEG "bug" discovered via https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/common.c#n1692 (and for which apparently there is a new CPUID flag NullSelect­ClearsBase that indicates 0x00 is not ignored) https://en.wikipedia.org/wiki/CPUID#EAX=80000021h:_Extended_Feature_Identification_2

The TLDR is mostly that this "x86_64 GS register is not set as expected" warning should likely just be removed because there isn't really a guarantee of its behavior anyways. And it's not the source of failure.


As for the log, first off,

error 2023-05-25T18:17:16.284 nodeos    main.cpp:161                  main                 ] 3170000 producer_exception: Producer exception
read-only threads failed to start. num_threads_started: 0, time_slept_ms: 1000ms
    {"n":0,"t":1000}

This is the true warning. For whatever reason, the 3 ROtrx threads didn't make enough progress in 1 second, which is considered an error.

The log entries have some peculiar timing. Notice in particular

info  2023-05-25T18:16:15.366 nodeos    producer_plugin.cpp:1222      plugin_startup       ] producer plugin:  plugin_startup() begin
warn  2023-05-25T18:16:19.003 read-0    executor.cpp:142              executor             ] x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform

This is nearly 4 seconds between producer_plugin::plugin_startup() and eosvmoc::executor() -- an incredible amount of time.

We also see

info  2023-05-25T18:16:19.157 nodeos    producer_plugin.cpp:1316      plugin_shutdown      ] exit shutdown
info  2023-05-25T18:16:19.157 nodeos    net_plugin.cpp:3908           plugin_shutdown      ] shutdown..
info  2023-05-25T18:16:19.157 nodeos    net_plugin.cpp:2722           plugin_shutdown      ] close 0 connections
warn  2023-05-25T18:16:19.158 read-2    executor.cpp:142              executor             ] x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform

This suggests the ROtrx threads are still being started (and running!) from producer_plugin after producer_plugin's plugin_shutdown has completed. But https://github.com/AntelopeIO/leap/blob/f53faff7a95568e4c68e7b6d4ecfd9feafa27a99/plugins/producer_plugin/producer_plugin.cpp#L1375-L1378 But.. keep in mind that the plugin_shutdown() is probably coming from https://github.com/AntelopeIO/leap/blob/f53faff7a95568e4c68e7b6d4ecfd9feafa27a99/plugins/producer_plugin/producer_plugin.cpp#L1363-L1364 I still don't have a good explanation for the mix up in ordering

bhazzard commented 1 year ago

We believe that this is likely a red herring that will be addressed by #1223