direct-code-execution / ns-3-dce

Run real programs in the discrete time simulator ns3
http://www.nsnam.org/projects/direct-code-execution/
75 stars 46 forks source link

dce-mptcp-lte-wifi(-v6).cc example fails with segmentation fault #122

Closed tomhenderson closed 2 years ago

tomhenderson commented 2 years ago

Description of the problem

As reported in the Ubuntu 20.04 issue, both dce-mptctp-lte-wifi-v6.cc and dce-mptcp-lte-wifi.cc will exit with an error. For the IPv4 variant:

assert failed. cond="address.CheckCompatible (GetType (), 6)", +0.000000000s 0 file=../src/network/utils/mac48-address.cc, line=129
terminate called without an active exception

On the IPv4 variant, if the --disLte=1 argument is passed, then the program runs successfully. There isn't such an option in the IPv6 program.

Running the program through gdb does not lend any immediate insights; it does not break at the point of assertion.

Running the IPv4 program through valgrind yields:

assert failed. cond="address.CheckCompatible (GetType (), 6)", +0.000000000s 0 file=../src/network/utils/mac48-address.cc, line=129
==31966== Thread 23:
==31966== Syscall param rt_sigaction(act->sa_mask) points to uninitialised byte(s)
==31966==    at 0xCA2A48E: __libc_sigaction (sigaction.c:62)
==31966==    by 0x5821EA4: ns3::FatalImpl::FlushStreams() (fatal-impl.cc:165)
==31966==    by 0x654AFA3: ns3::Mac48Address::ConvertFrom(ns3::Address const&) (mac48-address.cc:129)
==31966==    by 0x50D5CEC: ns3::KernelSocketFdFactory::NotifyDeviceStateChangeTask(ns3::Ptr<ns3::NetDevice>) (kernel-socket-fd-factory.cc:472)
==31966==    by 0x50D668C: ns3::KernelSocketFdFactory::NotifyAddDeviceTask(ns3::Ptr<ns3::NetDevice>) (kernel-socket-fd-factory.cc:553)
==31966==    by 0x50DCA77: ns3::EventImpl* ns3::MakeEvent<void (ns3::KernelSocketFdFactory::*)(ns3::Ptr<ns3::NetDevice>), ns3::KernelSocketFdFactory*, ns3::Ptr<ns3::NetDevice> >(void (ns3::KernelSocketFdFactory::*)(ns3::Ptr<ns3::NetDevice>), ns3::KernelSocketFdFactory*, ns3::Ptr<ns3::NetDevice>)::EventMemberImpl1::Notify() (make-event.h:405)
==31966==    by 0x576D004: ns3::EventImpl::Invoke() (event-impl.cc:51)
==31966==    by 0x50D5E9A: ns3::KernelSocketFdFactory::ScheduleTaskTrampoline(void*) (kernel-socket-fd-factory.cc:488)
==31966==    by 0x504A143: ns3::TaskManager::Trampoline(void*) (task-manager.cc:275)
==31966==    by 0x5046393: ns3::PthreadFiberManager::Run(void*) (pthread-fiber-manager.cc:402)
==31966==    by 0xCA206B9: start_thread (pthread_create.c:333)
==31966==    by 0xCD3D51C: clone (clone.S:109)
==31966==  Address 0x13ebf948 is on thread 23's stack
==31966== 
==31966== Syscall param rt_sigaction(act->sa_mask) points to uninitialised byte(s)
==31966==    at 0xCA2A48E: __libc_sigaction (sigaction.c:62)
==31966==    by 0x5821F28: ns3::FatalImpl::FlushStreams() (fatal-impl.cc:179)
==31966==    by 0x654AFA3: ns3::Mac48Address::ConvertFrom(ns3::Address const&) (mac48-address.cc:129)
==31966==    by 0x50D5CEC: ns3::KernelSocketFdFactory::NotifyDeviceStateChangeTask(ns3::Ptr<ns3::NetDevice>) (kernel-socket-fd-factory.cc:472)
==31966==    by 0x50D668C: ns3::KernelSocketFdFactory::NotifyAddDeviceTask(ns3::Ptr<ns3::NetDevice>) (kernel-socket-fd-factory.cc:553)
==31966==    by 0x50DCA77: ns3::EventImpl* ns3::MakeEvent<void (ns3::KernelSocketFdFactory::*)(ns3::Ptr<ns3::NetDevice>), ns3::KernelSocketFdFactory*, ns3::Ptr<ns3::NetDevice> >(void (ns3::KernelSocketFdFactory::*)(ns3::Ptr<ns3::NetDevice>), ns3::KernelSocketFdFactory*, ns3::Ptr<ns3::NetDevice>)::EventMemberImpl1::Notify() (make-event.h:405)
==31966==    by 0x576D004: ns3::EventImpl::Invoke() (event-impl.cc:51)
==31966==    by 0x50D5E9A: ns3::KernelSocketFdFactory::ScheduleTaskTrampoline(void*) (kernel-socket-fd-factory.cc:488)
==31966==    by 0x504A143: ns3::TaskManager::Trampoline(void*) (task-manager.cc:275)
==31966==    by 0x5046393: ns3::PthreadFiberManager::Run(void*) (pthread-fiber-manager.cc:402)
==31966==    by 0xCA206B9: start_thread (pthread_create.c:333)
==31966==    by 0xCD3D51C: clone (clone.S:109)
==31966==  Address 0x13ebf948 is on thread 23's stack
==31966== 
terminate called without an active exception
==31966== 
==31966== Process terminating with default action of signal 6 (SIGABRT)
==31966==    at 0xCC6B438: raise (raise.c:54)
==31966==    by 0xCC6D039: abort (abort.c:89)
==31966==    by 0xC51084C: __gnu_cxx::__verbose_terminate_handler() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
==31966==    by 0xC50E6B5: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
==31966==    by 0xC50E700: std::terminate() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
==31966==    by 0x654AFA8: ns3::Mac48Address::ConvertFrom(ns3::Address const&) (mac48-address.cc:129)
==31966==    by 0x50D5CEC: ns3::KernelSocketFdFactory::NotifyDeviceStateChangeTask(ns3::Ptr<ns3::NetDevice>) (kernel-socket-fd-factory.cc:472)
==31966==    by 0x50D668C: ns3::KernelSocketFdFactory::NotifyAddDeviceTask(ns3::Ptr<ns3::NetDevice>) (kernel-socket-fd-factory.cc:553)
==31966==    by 0x50DCA77: ns3::EventImpl* ns3::MakeEvent<void (ns3::KernelSocketFdFactory::*)(ns3::Ptr<ns3::NetDevice>), ns3::KernelSocketFdFactory*, ns3::Ptr<ns3::NetDevice> >(void (ns3::KernelSocketFdFactory::*)(ns3::Ptr<ns3::NetDevice>), ns3::KernelSocketFdFactory*, ns3::Ptr<ns3::NetDevice>)::EventMemberImpl1::Notify() (make-event.h:405)
==31966==    by 0x576D004: ns3::EventImpl::Invoke() (event-impl.cc:51)
==31966==    by 0x50D5E9A: ns3::KernelSocketFdFactory::ScheduleTaskTrampoline(void*) (kernel-socket-fd-factory.cc:488)
==31966==    by 0x504A143: ns3::TaskManager::Trampoline(void*) (task-manager.cc:275)

Running the IPv6 program through valgrind shows:

==32012== Memcheck, a memory error detector
==32012== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==32012== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==32012== Command: dce-mptcp-lte-wifi-v6
==32012== 
==32012== Jump to the invalid address stated on the next line
==32012==    at 0xFECABFE: ???
==32012==    by 0x52A3115: ns3::DlmLoaderFactory::Create(int, char**, char**) (dlm-loader-factory.cc:143)
==32012==    by 0x5323106: ns3::LinuxSocketFdFactory::NotifyNewAggregate() (linux-socket-fd-factory.cc:49)
==32012==    by 0x59DE3EF: ns3::Object::AggregateObject(ns3::Ptr<ns3::Object>) (object.cc:313)
==32012==    by 0x5304FE0: ns3::DceManagerHelper::Install(ns3::Ptr<ns3::Node>) (dce-manager-helper.cc:125)
==32012==    by 0x5304C33: ns3::DceManagerHelper::Install(ns3::NodeContainer) (dce-manager-helper.cc:104)
==32012==    by 0x41A06C: main (dce-mptcp-lte-wifi-v6.cc:168)
==32012==  Address 0xfecabfe is not stack'd, malloc'd or (recently) free'd
==32012== 
==32012== 
==32012== Process terminating with default action of signal 11 (SIGSEGV)
==32012==  Access not within mapped region at address 0xFECABFE
==32012==    at 0xFECABFE: ???
==32012==    by 0x52A3115: ns3::DlmLoaderFactory::Create(int, char**, char**) (dlm-loader-factory.cc:143)
==32012==    by 0x5323106: ns3::LinuxSocketFdFactory::NotifyNewAggregate() (linux-socket-fd-factory.cc:49)
==32012==    by 0x59DE3EF: ns3::Object::AggregateObject(ns3::Ptr<ns3::Object>) (object.cc:313)
==32012==    by 0x5304FE0: ns3::DceManagerHelper::Install(ns3::Ptr<ns3::Node>) (dce-manager-helper.cc:125)
==32012==    by 0x5304C33: ns3::DceManagerHelper::Install(ns3::NodeContainer) (dce-manager-helper.cc:104)
==32012==    by 0x41A06C: main (dce-mptcp-lte-wifi-v6.cc:168)
==32012==  If you believe this happened as a result of a stack
==32012==  overflow in your program's main thread (unlikely but
==32012==  possible), you can try to increase the size of the
==32012==  main thread stack using the --main-stacksize= flag.
==32012==  The main thread stack size used in this run was 16003072.
==32012== 
==32012== HEAP SUMMARY:
==32012==     in use at exit: 1,333,023 bytes in 16,131 blocks
==32012==   total heap usage: 42,601 allocs, 26,470 frees, 3,336,502 bytes allocated
==32012== 
==32012== LEAK SUMMARY:
==32012==    definitely lost: 0 bytes in 0 blocks
==32012==    indirectly lost: 0 bytes in 0 blocks
==32012==      possibly lost: 0 bytes in 0 blocks
==32012==    still reachable: 1,333,023 bytes in 16,131 blocks
==32012==         suppressed: 0 bytes in 0 blocks
==32012== Rerun with --leak-check=full to see details of leaked memory
==32012== 
==32012== For counts of detected and suppressed errors, rerun with: -v
==32012== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)
ParthPratim commented 2 years ago

Based on what I could debug, the issue with IPV6 showing a SIGSEGV and not the assertion error, is with elf-loader (and I could reproduce the same with the dce-iperf example too by setting DlmLoader as the deault loader), as DlmLoader uses the libvdl.so file from the elf-loader for assigning Lmid namespaces for the other shared objects that are loaded using dlmopen. There seems to be some problem witht he elf-loader, but gdb cannot read the symbols from it.

Also, just commenting out the line from the example which sets DlmLoader as the default loader, makes it give a similar assertion error as in the v4 example.

I'll try to dig into it further.

tomhenderson commented 2 years ago

I think I figured out the problem on the Address assert.

LTE uses Mac64Address (EUI-64 address type). However, the DCE code is getting the address of this device and trying to convert to a Mac48Address.

In KernelSocketFdFactory::NotifyDeviceStateChangeTask there is this statement:

Mac48Address ad = Mac48Address::ConvertFrom (device->GetAddress ());

The address is type 3 (Mac64Address) length 8. If we enable NS_LOG=Mac48Address on this example, we see:

+0.000000000s 3 Mac48Address:ConvertFrom(03-08-00:00:00:00:00:00:00:01)

which is an LTE device MAC address.

Tommaso changed this address to 64 bit from 48 bit in fixing https://www.nsnam.org/bugzilla/show_bug.cgi?id=2768.

commit af5691366c66bea554e442fa59afbfdada9e834c
Author: Tommaso Pecorella <tommaso.pecorella@unifi.it>
Date:   Mon Jan 29 21:29:02 2018 -0600

    lte: (fixes #2768) LteUeNetDevice has a null MAC address

So I suspect that this has been broken since that time.

As for how to fix it, one possibility (kind of a hack) is to truncate the top two bytes of the LTE Mac64 address and fit it back into a 48 bit field like it was before, within the DCE code. This would mean that ns-3's view of this address is 64-bit, but DCE sees it as a 48-bit address-- this difference may not matter in practice.

ParthPratim commented 2 years ago

Thank you Sir for the writeup on the issue.

I was wondering, if LTE uses 64 bit addresses, then why don't we just move from 48 to 64 in DCE's implementation, as we just CopyTo or CopyFrom different src and/or dst buffers. Will that be feasible ? I'm sorry I actually don't have much idea about this, so I'm just guessing if that's a possibility based on the source code.

tomhenderson commented 2 years ago

I'm guessing that there are assumptions in DCE that the MAC addresses that it has to handle are 48 bits, so I think we would end up changing code in net-next-nuse. That is why I think that we ought to try mapping the mac addresses from LTE devices to 48 bits like they used to be prior to 2018 (since the upper two bytes are just zero padding).

In the LTE transmit direction, the Address parameter is ignored in the ::Send() method call, so I don't think any changes are needed in that direction. In the receive direction, the handling of Address parameters should first check the Address length (GetLength()) and then if length is 8 bytes, there should be a mapping to a 6 byte value.

One final consideration is in this mapping. A simple truncation of 8-byte to 6-byte may work, but it also may open the possibility that there is a collision between Mac48Address on different interfaces, because Mac64Address and Mac48Address are unique only within the scope of their class, so for instance, we might have on a node an LTE device with a Mac64Address of 00:00:00:00:00:00:00:01 and a CSMA device with a Mac48Address of 00:00:00:00:00:01 and then if we truncated the Mac64Address, we would have identical Mac48Address values. So perhaps truncation and bitwise OR with a value like ff:00:00:00:00:00 would be safer.

ParthPratim commented 2 years ago

Okay Sir, I understand the idea. I had a few questions regarding a possible way to implement this.

Could be write these operations in the form a function defined under KernelSocketFdFactory with scope restricted to only itself ? Or should we write in the form of a DCE global utility ?

Also Sir, by bitwise OR, do you mean ORing (f , 0) in the 1st and 2nd position of the 6th bit of the Mac48Address separately and then placing them in ther respective positions ?

tomhenderson commented 2 years ago

Okay Sir, I understand the idea. I had a few questions regarding a possible way to implement this.

Could be write these operations in the form a function defined under KernelSocketFdFactory with scope restricted to only itself ? Or should we write in the form of a DCE global utility ?

I would localize in KernelSocketFdFactory with a comment that refers back to this issue.

Also Sir, by bitwise OR, do you mean ORing (f , 0) in the 1st and 2nd position of the 6th bit of the Mac48Address separately and then placing them in ther respective positions ?

I mean address_bytes |= 0xff0000000000

ParthPratim commented 2 years ago

Okay Sir.

Sir, I actually tried to truncate the extra two bits, but even after I did that, it reached the CheckCompatible(...) assertion error because the type of the address of still 64 bit, and after I manipulated the bits, m_len became 6.

I noticed all other Mac48Address have the type bit set to 02, but in our case with Mac64Address it was 03. I manually changed that to 02, and I could get the example to run. But, I was wondering if there was better way to get the type number and if there's some ns-3 api call which can help us with that ? I'll try to dig into it.

Post this, I'll also try to handle the bitwise OR operation.

tomhenderson commented 2 years ago

Patch to fix: https://github.com/direct-code-execution/ns-3-dce/pull/127

This patch simply converts Mac64Address to Mac48Address as necessary by discarding the two most significant bytes from the 64-bit address. Doing further modifications to the converted address was not successful, and it is not clear whether any further modifications are needed.