eclipse-ecal / ecal

📦 eCAL - enhanced Communication Abstraction Layer. A high performance publish-subscribe, client-server cross-plattform middleware.
https://ecal.io
Apache License 2.0
842 stars 174 forks source link

Buffer memory not aligned in TCP mode (-> Capnproto issue) #1017

Closed chengguizi closed 1 year ago

chengguizi commented 1 year ago

Problem Description

I run in to this issue on the receiver side, when I am in TCP mode

./ecal_sample_addressbook_rec_cb
terminate called after throwing an instance of 'kj::ExceptionImpl'
  what():  capnp/arena.c++:76: failed: expected reinterpret_cast<uintptr_t>(segment.begin()) % sizeof(void*) == 0 [7 == 0]; Detected unaligned data in Cap'n Proto message. Messages must be aligned to the architecture's word size. Yes, even on x86: Unaligned access is undefined behavior under the C/C++ language standard, and compilers can and do assume alignment for the purpose of optimizations. Unaligned access may lead to crashes or subtle corruption. For example, GCC will use SIMD instructions in optimizations, and those instrsuctions require alignment. If you really insist on taking your changes with unaligned data, compile the Cap'n Proto library with -DCAPNP_ALLOW_UNALIGNED to remove this check.
stack: 7fa7f7174ce5 7fa7f7174d36 7fa7f719e8b7 7fa7f71a106a 5628f81dbd53 5628f81dec50 5628f81e2432 5628f81e1c21 5628f81e1384 5628f81e096c 5628f81dfd01 7fa7f6ffc9cb 7fa7f6ff3f7e 7fa7f7003f70 7fa7f709d605 7fa7f6d3ede3 7fa7f6e52608 7fa7f6b78132
Aborted (core dumped)

After a few debugging steps, I have single out the line that raise this issue, which is initMessageBuilderFromFlatArrayCopy which checks for memory alignment I suppose.

      bool Deserialize(capnp::MallocMessageBuilder& msg_, const void* buffer_, size_t size_) const
      {
        kj::ArrayPtr<const capnp::word> words = kj::arrayPtr(reinterpret_cast<const capnp::word*>(buffer_), size_ / sizeof(capnp::word));
        kj::ArrayPtr<const capnp::word> rest = initMessageBuilderFromFlatArrayCopy(words, msg_);
        return(rest.size() == 0);
      }

This leads me to believe eCAL didn't perform proper memory alignment of the buffer memory. This seems not happening always, but a lot of times.

How to reproduce

First, compile and run https://github.com/chengguizi/ecal/tree/memory-unaligned-tcp/samples/cpp/capnp/addressbook_snd

./ecal_sample_addressbook_snd
CTCPReaderLayer - TCPPubSub (Error) -Publisher ?: Error while waiting for subsriber: Operation aborted.
our string: (people = [(id = 123, name = "Alice", email = "alice@example.com", phones = [(number = "555-1212", type = mobile)], employment = (school = "MIT"), weight = 60.4, data = "\037\000\241\264\024T"), (id = 456, name = "Bob", email = "bob@example.com", phones = [(number = "555-4567", type = home), (number = "555-7654", type = work)], employment = (unemployed = void), weight = 80.8)])

our string: (people = [(id = 123, name = "Alice", email = "alice@example.com", phones = [(number = "555-1212", type = mobile)], employment = (school = "MIT"), weight = 60.4, data = "\037\000\241\264\024T"), (id = 456, name = "Bob", email = "bob@example.com", phones = [(number = "555-4567", type = home), (number = "555-7654", type = work)], employment = (unemployed = void), weight = 80.8)])

Second, compile and run https://github.com/chengguizi/ecal/tree/memory-unaligned-tcp/samples/cpp/capnp/addressbook_rec_cb

./ecal_sample_addressbook_rec_cb
terminate called after throwing an instance of 'kj::ExceptionImpl'
  what():  capnp/arena.c++:76: failed: expected reinterpret_cast<uintptr_t>(segment.begin()) % sizeof(void*) == 0 [4 == 0]; Detected unaligned data in Cap'n Proto message. Messages must be aligned to the architecture's word size. Yes, even on x86: Unaligned access is undefined behavior under the C/C++ language standard, and compilers can and do assume alignment for the purpose of optimizations. Unaligned access may lead to crashes or subtle corruption. For example, GCC will use SIMD instructions in optimizations, and those instrsuctions require alignment. If you really insist on taking your changes with unaligned data, compile the Cap'n Proto library with -DCAPNP_ALLOW_UNALIGNED to remove this check.
stack: 7f9b76de7ce5 7f9b76de7d36 7f9b76e118b7 7f9b76e1406a 5575f1812d53 5575f1815c50 5575f1819432 5575f1818c21 5575f1818384 5575f181796c 5575f1816d01 7f9b76c6f9cb 7f9b76c66f7e 7f9b76c76f70 7f9b76d10605 7f9b769b1de3 7f9b76ac5608 7f9b767eb132
Aborted (core dumped)

Maybe need the following lines in CMakeLists.txt

# set( CMAKE_BUILD_TYPE Release)
# set(CMAKE_CXX_FLAGS_RELEASE "-O3 -DNDEBUG")

How did you get eCAL?

Ubuntu PPA (apt-get)

Environment

Ubuntu 20.04 ARM64

eCAL System Information

------------------------- SYSTEM ---------------------------------
Version                  : v5.11.3 (2023-02-17 09:13:01 +0100)
Platform                 : linux

------------------------- CONFIGURATION --------------------------
Default INI              : /etc/ecal/ecal.ini

------------------------- NETWORK --------------------------------
Host name                : huimin-Vostro-5320
Network mode             : local
Network ttl              : 2
Network sndbuf           : 5 MByte
Network rcvbuf           : 5 MByte
Multicast group          : 239.0.0.1
Multicast mask           : 0.0.0.15
Multicast ports          : 14000 - 14010
Multicast join all IFs   : off
Bandwidth limit (udp)    : not limited

------------------------- TIME -----------------------------------
Synchronization realtime : "ecaltime-localtime"
Synchronization replay   :
State                    :  synchronized
Master / Slave           :  Master
Status (Code)            : "everything is fine." (0)

------------------------- PUBLISHER LAYER DEFAULTS ---------------
Layer Mode INPROC        : auto
Layer Mode SHM           : off
Layer Mode TCP           : auto
Layer Mode UDP MC        : off

------------------------- SUBSCRIPTION LAYER DEFAULTS ------------
Layer Mode INPROC        : on
Layer Mode SHM           : on
Layer Mode TCP           : on
Layer Mode UDP MC        : on
Npcap UDP Reciever       : off
FlorianReimold commented 1 year ago

Hi @chengguizi,

Thanks for reporting this and especially for providing the sourcecode. That should make it easy for us to reproduce and debug the issue. But are you sure, you are running on ARM 64, as you mentioned? One of the error messages clearly states x86 and a quick google search told me that the "Vostro-5320" has a regular Intel processor, as well.

Kind regards Florian

chengguizi commented 1 year ago

Ahh sorry. I run both on ARM64 and AMD64. Tested on both platforms!

chengguizi commented 1 year ago

Hi, just to check in, are you guys able to reproduce on your end?

FlorianReimold commented 1 year ago

Hi @chengguizi,

I didn't try to reproduce it, yet, but I checked the source code to see what could cause the alignment issue. I think it is the following line:

https://github.com/eclipse-ecal/ecal/blob/4ead87d7f7c0fa56b4f87ffeeb41b9ef64688c92/ecal/core/src/readwrite/ecal_writer_tcp.cpp#L135

with:

After the header consisting of those 3 parts (ecal_magic, header_length, proto_header), the user payload is directly appended. There is no padding, so basically the capnproto data is never aligned. It should be easy to add some padding, but we need to make sure, that the proto header can still be parsed.

chengguizi commented 1 year ago

Ok! That makes sense :)

FlorianReimold commented 1 year ago

@chengguizi : I created a fix that should solve this issue. It is 100% compatible, but that also makes it ugly, unfortunately. Would you be able to test it out and report back, if it fixes the issue for you?

Here is the branch: https://github.com/eclipse-ecal/ecal/tree/feature/tcp_payload_alignment

FlorianReimold commented 1 year ago

@chengguizi: the branch was accidentally merged to master, already. Could you maybe still test it?

chengguizi commented 1 year ago

Hi @FlorianReimold Thanks for the effort in fixing it! May I know what is the best way to test this? I have been using binaries all along.

Do I have to uninstall the binary, compile the branch from source, and then install to the system in order to test?

FlorianReimold commented 1 year ago

Can you grab the binary from our CI? You need to be logged in with your GitHub accout, but then you should be able to just download a zip file ubuntu-debian containing the eCAL .deb installer. Here is a link to the Ubuntu 20.04 build: https://github.com/eclipse-ecal/ecal/actions/runs/4627122057

It's only for amd64, as the arm binaries only come from the launchpad PPA. I hope that's ok.

After unzipping, you can install it with: sudo dpkg -i eCAL-5.12.0-Linux.deb.

When you later want to go back to the Version coming from our ppa, you can just remove the installed eCAL and install it with apt again:

sudo apt remove ecal
sudo apt install ecal
KerstinKeller commented 1 year ago

@chengguizi @FlorianReimold can this issue be closed?

FlorianReimold commented 1 year ago

Yes, let's close it. I hoped that chengguizi would try out the fix, but he didn't.