Closed matthewdarwin closed 1 year ago
Nodeos issues this message after already loading state (in heap mode) and then exits
This message is probably a red herring then as it's only a informational warning and doesn't cause nodeos to exit per se. When you see this error what kind of exit code does nodeos exit with?
Full log files have been sent privately.
May 25 18:16:15 waxtest-ship11 nodeos[429697]: info 2023-05-25T18:16:15.366 nodeos http_plugin.cpp:378 add_async_handler ] add api url: /v1/net/connect
May 25 18:16:15 waxtest-ship11 nodeos[429697]: info 2023-05-25T18:16:15.366 nodeos http_plugin.cpp:378 add_async_handler ] add api url: /v1/net/connections
May 25 18:16:15 waxtest-ship11 nodeos[429697]: info 2023-05-25T18:16:15.366 nodeos http_plugin.cpp:378 add_async_handler ] add api url: /v1/net/disconnect
May 25 18:16:15 waxtest-ship11 nodeos[429697]: info 2023-05-25T18:16:15.366 nodeos http_plugin.cpp:378 add_async_handler ] add api url: /v1/net/status
May 25 18:16:15 waxtest-ship11 nodeos[429697]: info 2023-05-25T18:16:15.366 nodeos producer_plugin.cpp:1222 plugin_startup ] producer plugin: plugin_startup() begin
May 25 18:16:19 waxtest-ship11 nodeos[429697]: warn 2023-05-25T18:16:19.003 read-0 executor.cpp:142 executor ] x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform
May 25 18:16:19 waxtest-ship11 nodeos[429697]: warn 2023-05-25T18:16:19.079 read-1 executor.cpp:142 executor ] x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform
May 25 18:16:19 waxtest-ship11 nodeos[429697]: info 2023-05-25T18:16:19.157 nodeos producer_plugin.cpp:1316 plugin_shutdown ] exit shutdown
May 25 18:16:19 waxtest-ship11 nodeos[429697]: info 2023-05-25T18:16:19.157 nodeos net_plugin.cpp:3908 plugin_shutdown ] shutdown..
May 25 18:16:19 waxtest-ship11 nodeos[429697]: info 2023-05-25T18:16:19.157 nodeos net_plugin.cpp:2722 plugin_shutdown ] close 0 connections
May 25 18:16:19 waxtest-ship11 nodeos[429697]: warn 2023-05-25T18:16:19.158 read-2 executor.cpp:142 executor ] x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform
May 25 18:16:19 waxtest-ship11 nodeos[429697]: info 2023-05-25T18:16:19.195 nodeos net_plugin.cpp:3912 plugin_shutdown ] exit shutdown
May 25 18:16:19 waxtest-ship11 nodeos[429697]: CHAINBASE: Writing "state" database file, this could take a moment...
May 25 18:16:20 waxtest-ship11 nodeos[429697]: CHAINBASE: Writing "state" database file, 3% complete...
May 25 18:16:49 waxtest-ship11 nodeos[429697]: CHAINBASE: Writing "state" database file, 89% complete...
May 25 18:16:49 waxtest-ship11 nodeos[429697]: CHAINBASE: Writing "state" database file, flushing buffers...
May 25 18:17:14 waxtest-ship11 nodeos[429697]: CHAINBASE: Writing "state" database file, complete.
May 25 18:17:16 waxtest-ship11 nodeos[429697]: info 2023-05-25T18:17:16.282 nodeos http_plugin.cpp:368 plugin_shutdown ] exit shutdown
May 25 18:17:16 waxtest-ship11 nodeos[429697]: error 2023-05-25T18:17:16.284 nodeos main.cpp:161 main ] 3170000 producer_exception: Producer exception
May 25 18:17:16 waxtest-ship11 nodeos[429697]: read-only threads failed to start. num_threads_started: 0, time_slept_ms: 1000ms
May 25 18:17:16 waxtest-ship11 nodeos[429697]: {"n":0,"t":1000}
May 25 18:17:16 waxtest-ship11 nodeos[429697]: nodeos producer_plugin.cpp:1283 plugin_startup
May 25 18:17:16 waxtest-ship11 nodeos[429697]: {}
May 25 18:17:16 waxtest-ship11 nodeos[429697]: nodeos producer_plugin.cpp:1296 plugin_startup
May 25 18:17:16 waxtest-ship11 systemd[1]: nodeos.service: Main process exited, code=exited, status=254/n/a
There is a lot to talk about here.
The warning https://github.com/AntelopeIO/leap/blob/f53faff7a95568e4c68e7b6d4ecfd9feafa27a99/libraries/chain/webassembly/runtimes/eos-vm-oc/executor.cpp#L142-L144 was added late in the 2.0.0 development cycle as a paranoid sniff test against something else using the GS segment selector interfering with EOS VM OC. At the time, we were only aware of https://clang.llvm.org/docs/ShadowCallStack.html And that is expressly tested for here https://github.com/AntelopeIO/leap/blob/f53faff7a95568e4c68e7b6d4ecfd9feafa27a99/libraries/chain/webassembly/runtimes/eos-vm-oc/executor.cpp#L25-L29 besides, that feature has been removed as of clang 9.0 anyways.
This sniff test assumes that upon start of the application (and now, start of a new thread), the GS segment selector is initialized to 0x00. I am not aware of this being documented behavior anywhere, rather at the time of implementation it was simply empirically observed.
When looking at Linux source, we can find,
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/process_64.c#n506
this initially looks like load_gs_index(0);
but looking closer there is more to it since,
if (static_cpu_has(X86_BUG_NULL_SEG)) {
/* Loading zero below won't clear the base. */
loadsegment(fs, __USER_DS);
load_gs_index(__USER_DS);
}
It turns out some older AMD CPUs ignore setting FS or GS to 0x00; AMD documents in their AMD64 spec that
When a null selector is loaded into FS or GS, the contents of the corresponding hidden descriptor register are not altered
and the kernel is looking for that sort of behavior with the X86_BUG_NULL_SEG
"bug" discovered via
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/common.c#n1692
(and for which apparently there is a new CPUID flag NullSelectClearsBase
that indicates 0x00 is not ignored)
https://en.wikipedia.org/wiki/CPUID#EAX=80000021h:_Extended_Feature_Identification_2
The TLDR is mostly that this "x86_64 GS register is not set as expected" warning should likely just be removed because there isn't really a guarantee of its behavior anyways. And it's not the source of failure.
As for the log, first off,
error 2023-05-25T18:17:16.284 nodeos main.cpp:161 main ] 3170000 producer_exception: Producer exception
read-only threads failed to start. num_threads_started: 0, time_slept_ms: 1000ms
{"n":0,"t":1000}
This is the true warning. For whatever reason, the 3 ROtrx threads didn't make enough progress in 1 second, which is considered an error.
The log entries have some peculiar timing. Notice in particular
info 2023-05-25T18:16:15.366 nodeos producer_plugin.cpp:1222 plugin_startup ] producer plugin: plugin_startup() begin
warn 2023-05-25T18:16:19.003 read-0 executor.cpp:142 executor ] x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform
This is nearly 4 seconds between producer_plugin::plugin_startup()
and eosvmoc::executor()
-- an incredible amount of time.
We also see
info 2023-05-25T18:16:19.157 nodeos producer_plugin.cpp:1316 plugin_shutdown ] exit shutdown
info 2023-05-25T18:16:19.157 nodeos net_plugin.cpp:3908 plugin_shutdown ] shutdown..
info 2023-05-25T18:16:19.157 nodeos net_plugin.cpp:2722 plugin_shutdown ] close 0 connections
warn 2023-05-25T18:16:19.158 read-2 executor.cpp:142 executor ] x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform
This suggests the ROtrx threads are still being started (and running!) from producer_plugin after producer_plugin's plugin_shutdown has completed. But
https://github.com/AntelopeIO/leap/blob/f53faff7a95568e4c68e7b6d4ecfd9feafa27a99/plugins/producer_plugin/producer_plugin.cpp#L1375-L1378
But.. keep in mind that the plugin_shutdown()
is probably coming from
https://github.com/AntelopeIO/leap/blob/f53faff7a95568e4c68e7b6d4ecfd9feafa27a99/plugins/producer_plugin/producer_plugin.cpp#L1363-L1364
I still don't have a good explanation for the mix up in ordering
We believe that this is likely a red herring that will be addressed by #1223
Occasionally nodeos will issue error message "x86_64 GS register is not set as expected. EOS VM OC may not run correctly on this platform". Seems to happen on next restart after kill -9 of nodeos.
Nodeos issues this message after already loading state (in heap mode) and then exits. Our systemd unit will restart nodeos and then it runs fine.
What's going on here? Why doesn't it check this before loading all the state? Wasting cycles loading RAM and then unloading RAM. What is this test doing?