eclipse-zenoh / zenoh

zenoh unifies data in motion, data in-use, data at rest and computations. It carefully blends traditional pub/sub with geo-distributed storages, queries and computations, while retaining a level of time and space efficiency that is well beyond any of the mainstream stacks.
https://zenoh.io
Other
1.43k stars 150 forks source link

Core dump when running the 2nd instance of zenohd on the same subnet #68

Closed heyong4725 closed 3 months ago

heyong4725 commented 3 years ago

ubuntu@ecs-zenoh-yhe-01:~/zenoh/target/release$ ./zenohd --version The zenoh router v0.5.0-beta.5-224-g87cf763-modified built with rustc 1.51.0-nightly (2987785df 2020-12-28)

Start the first zenohd instance with RUST_LOG=debug flag, no issue: [2021-02-19T15:14:33Z DEBUG zenoh_router::routing::pubsub] Register subscription /@/router/3C24E4AEED654EC48C521E191957EB19/plugin/storages/backend/ for face 0 [2021-02-19T15:14:33Z DEBUG zenoh_router::routing::pubsub] Register router subscription /@/router/3C24E4AEED654EC48C521E191957EB19/plugin/storages/backend/ (router: 3C24E4AEED654EC48C521E191957EB19) [2021-02-19T15:14:33Z DEBUG zenoh_router::routing::pubsub] Register peer subscription /@/router/3C24E4AEED654EC48C521E191957EB19/plugin/storages/backend/* (peer: 3C24E4AEED654EC48C521E191957EB19) [2021-02-19T15:14:33Z DEBUG zenoh_router::routing::router] New face 2 [2021-02-19T15:14:33Z INFO tide::server] Server listening on http://0.0.0.0:8000

Then start the second zenohd instance with RUST_LOG=debug flag on different machine on the same subnet, core dump:

ubuntu@ecs-zenoh-yhe-01:~/zenoh/target/release$ RUST_LOG=debug ./zenohd [2021-02-19T15:15:00Z DEBUG zenohd] zenohd v0.5.0-beta.5-224-g87cf763-modified built with rustc 1.51.0-nightly (2987785df 2020-12-28) [2021-02-19T15:15:00Z DEBUG zenoh_router::plugins] Plugins to load: [] [2021-02-19T15:15:00Z DEBUG zenoh_util::libloader] Search for libraries libzplugin*.so to load in ["/usr/local/lib", "/usr/lib", "/home/ubuntu/.zenoh/lib", "/home/ubuntu/zenoh/target/release", "/home/ubuntu/zenoh/target/release"] [2021-02-19T15:15:00Z DEBUG zenoh_util::lib_loader] Do not load plugin storages from "/home/ubuntu/zenoh/target/release/libzplugin_storages.so" : already loaded. [2021-02-19T15:15:00Z DEBUG zenoh_util::lib_loader] Do not load plugin rest from "/home/ubuntu/zenoh/target/release/libzplugin_rest.so" : already loaded. [2021-02-19T15:15:00Z DEBUG zenoh_util::lib_loader] Do not load plugin storages from "/home/ubuntu/zenoh/target/release/libzplugin_storages.so" : already loaded. [2021-02-19T15:15:00Z DEBUG zenoh_router::plugins] Plugin storages loaded from /usr/lib/libzplugin_storages.so [2021-02-19T15:15:00Z DEBUG zenoh_router::plugins] Plugin rest loaded from /home/ubuntu/zenoh/target/release/libzplugin_rest.so [2021-02-19T15:15:00Z DEBUG zenohd] Config: {"multicast_scouting": "true", "peer": "", "listener": "tcp/0.0.0.0:7447", "mode": "router", "add_timestamp": "true"} [2021-02-19T15:15:00Z INFO zenoh_router::runtime] Using PID: E354454D56274CC4B6EB457750C2D651 [2021-02-19T15:15:00Z DEBUG zenoh_router::routing::network] [Routers network] Add node (self) E354454D56274CC4B6EB457750C2D651 [2021-02-19T15:15:00Z DEBUG zenoh_router::routing::network] [Peers network] Add node (self) E354454D56274CC4B6EB457750C2D651 [2021-02-19T15:15:00Z DEBUG zenoh_router::runtime::orchestrator] Listener tcp/0.0.0.0:7447 added [2021-02-19T15:15:00Z INFO zenoh_router::runtime::orchestrator] zenohd can be reached on tcp/10.1.101.216:7447 [2021-02-19T15:15:00Z INFO zenoh_router::runtime::orchestrator] zenohd can be reached on tcp/172.17.0.1:7447 [2021-02-19T15:15:00Z DEBUG zenoh_router::runtime::orchestrator] UDP port bound to 224.0.0.224:7447 [2021-02-19T15:15:00Z DEBUG zenoh_router::runtime::orchestrator] Joined multicast group 224.0.0.224 [2021-02-19T15:15:00Z INFO zenoh_router::runtime::orchestrator] zenohd listening scout messages on 224.0.0.224:7447 [2021-02-19T15:15:00Z DEBUG zenoh_router::runtime::orchestrator] UDP port bound to 10.1.101.216:44347 [2021-02-19T15:15:00Z DEBUG zenoh_router::plugins] Start plugin storages [2021-02-19T15:15:00Z DEBUG zenoh_router::runtime::orchestrator] Waiting for UDP datagram... [2021-02-19T15:15:00Z DEBUG zenoh_router::plugins] Start plugin rest thread 'async-std/runtime' panicked at 'range end index 94126439026992 out of range for slice of length 16', zenoh-protocol/src/core/mod.rs:184:10 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Aborted (core dumped) ubuntu@ecs-zenoh-yhe-01:~/zenoh/target/release$

heyong4725 commented 3 years ago

Don't know why github strick out those lines...those are core dump traces

heyong4725 commented 3 years ago

With latest build (02/19/2021, after cargo clean, cargo update, cargo build --release), 2nd zenohd instance still core dumped:

ubuntu@ecs-zenoh-yhe-01:~/eclipse-zenoh/zenoh/target/release$ git fetch origin master From https://github.com/eclipse-zenoh/zenoh

Mallets commented 3 years ago

@heyong4725 do you still have this problem with the latest version on master?

heyong4725 commented 3 years ago

@Mallets I haven't get a chance to try it. I will try once I have the environment setup

OlivierHecart commented 3 years ago

This is probably due to incompatible storages and/or rest plugins. Maybe @JEnoch or @gabrik can provide more infos as the first investigated those problems and the second faced them.

gabrik commented 3 years ago

Hi @heyong4725, I had indeed a very similar issue, the cause was an old plugin causing the crash.

I see this line in your log Plugin storages loaded from /usr/lib/libzplugin_storages.so maybe that plugin is an old one and causes the crash.

Can you try to remove that file and restart zenoh?

heyong4725 commented 3 years ago

Yes, indeed. This is due to an old plugin. After I remove the /usr/lib/libzplugin_storages.so, there is no more core-dump.

Is there anyway to avoid core-dump?

heyong4725 commented 3 years ago

"Is there anyway to avoid core-dump?", what I mean here is if software can detect this kind of error, with fail-safe/fail-operational capability, gracefully give a warning and continue... without core dump.

JEnoch commented 3 years ago

@heyong4725 : the root of the problem is that Rust doesn't have a stable ABI, and probably won't before a while. The implication is that there is no guarantee that a Rust type compiled in zenohd has the same memory layout when compiled in a plugin/backend. And if they don't, zenohd might exchange incompatible data with the plugins/backend, leading to unpredictable behaviour, including core dump.

I don't think it's feasible at runtime to detect and recover such incompatible memory representation of types.

We rather tried to ensure that types compiled in both zenohd and plugins/backends have the same memory representation, by:

  1. forcing the Rust toolchain to be the same when building zenohd and the plugins/backends (see rust-toolchain files). But that was not enough, probably because some dependencies used by both by be at different versions.
  2. ensure that zenoh and plugins/backends don't use different versions of a dependency (see the committed Cargo.lock files that list the dependencies to be used). And that seems to work so far...

But the result is that the plugins/backends must have the exact same version than zenohd (we'll make sure for each release to have same toolchain and dependencies for all). Still, I just saw this comment that make me think that might not be enough:

ABI and even layout can change between any two compiler invocations even if they are 100% identical

We probably need to investigate in a more sustainable solution. I had a glance to abi_stable but it seems to bring lot of contraints, including a re-definition of the std types (RString, RVec, RSlice...).

heyong4725 commented 3 years ago

@JEnoch , thanks for detailed analysis. I like your thinking on a more sustainable solution.

This kind of ABI interface type incompatible problem must be a common issue, I am wondering if zenoh needs to use some kind of intermediate representation for this, similar to message passing.

In WebAssembly eco-system, especially the WASI subgroup, there is an effort on this called "Interface Types", below are a few links for you to evaluate / investigate:)

https://bytecodealliance.org/articles/1-year-update https://github.com/WebAssembly/WASI/blob/main/docs/witx.md https://www.youtube.com/watch?v=LCA9NnH7DxE

heyong4725 commented 3 years ago

When zenoh load shareable plugins (i.e. backend libraries, future zenoh flow operators), I think there might be a need for signature/authentication etc. I wonder if the https://crates.io/crates/minisign can be used for this purpose. It is still related to this issue that make sure alignments of all these components and avoid runtime core-dumps

Mallets commented 3 months ago

Plugin API has been updated in https://github.com/eclipse-zenoh/zenoh/releases/tag/0.11.0 taken care of checking ABI compatibility of plugins. Closing this issue since it should now be solved.