NetSys / bess

BESS: Berkeley Extensible Software Switch
Other
311 stars 174 forks source link

Segfaults causing BESS to die #722

Open rware opened 6 years ago

rware commented 6 years ago

I've got the following BESS config file (where $NIC_PCI is a NIC port connected to DPDK):

import os

pmdport = PMDPort(pci=$NIC_PCI)
vport = VPort(ifname='bess0', ip_addrs=[$IPADDR_MASK])

# check if environment variable says to use Queue
btl_queue = Queue(size=2**5)
btl_queue.set_burst(burst=1)
bess.add_tc('bit_limit',
       policy='rate_limit',
       resource='bit',
       limit={'bit': 100000})
btl_queue.attach_task(parent='bit_limit')
PortInc(port=pmdport.name) -> PortOut(port=vport)
PortInc(port=vport) -> btl_queue -> PortOut(port=pmdport.name)

os.system('sudo ifconfig bess0 broadcast {}'.format($BCAST))

I'm running the config file as follows: /opt/bess/bessctl/bessctl daemon start -- run [BESS_CONF_FILE_NAME] [ENV_VARS]

After running this BESS config file ~5-10 times any time I try to run after that the bess daemon will die with the following stack trace in /tmp/bessd_crash.log:

F1111 05:05:50.691910 16137 debug.cc:389] A critical error has occured. Aborting...
Signal: 11 (Segmentation fault), si_code: 1 (SEGV_MAPERR: address not mapped to object)
pid: 16098, tid: 16137, address: 0x140, IP: 0x73f946
Backtrace (recent calls first) ---
(0): /opt/bess/core/bessd(_ZN5VPort11RecvPacketsEhPPN4bess6PacketEi+0x176) [0x73f946]
    VPort::RecvPackets(unsigned char, bess::Packet**, int) at /opt/bess/core/drivers/vport.cc:589
         586:     pkt = pkts[i] = bess::Packet::from_paddr(paddr[i]);
         587: 
         588:     tx_desc = pkt->scratchpad<struct sn_tx_desc *>();
      -> 589:     len = tx_desc->total_len;
         590: 
         591:     pkt->set_data_off(SNBUF_HEADROOM);
         592:     pkt->set_total_len(len);
(1): /opt/bess/core/modules/port_inc.so(+0x75fec) [0x7fa904d63fec]
    PortInc::RunTask(void*) at /opt/bess/core/modules/port_inc.cc:121
      -> 121:   batch.set_cnt(p->RecvPackets(qid, batch.pkts(), burst));
(2): /opt/bess/core/modules/port_inc.so(_ZN7PortInc7RunTaskEPv+0x20) [0x7fa904d64260]
    PortInc::RunTask(void*) at /opt/bess/core/modules/port_inc.cc:102
      -> 102: struct task_result PortInc::RunTask(void *arg) {
(3): /opt/bess/core/bessd(_ZNK4TaskclEv+0x13) [0x68c0e3]
    Task::operator()() const at /opt/bess/core/task.cc:48
      -> 48:   return module_->RunTask(arg_);
(4): /opt/bess/core/bessd(_ZN4bess16DefaultScheduler12ScheduleLoopEv+0x2a9) [0x67e0e9]
    bess::DefaultScheduler::ScheduleOnce() at /opt/bess/core/scheduler.h:263
      -> 263:       auto ret = (*leaf->task())();
     (inlined by) bess::DefaultScheduler::ScheduleLoop() at /opt/bess/core/scheduler.h:246
      -> 246:       ScheduleOnce();
(5): /opt/bess/core/bessd(_ZN6Worker3RunEPv+0x1cb) [0x67c5eb]
    Worker::Run(void*) at /opt/bess/core/worker.cc:317
      -> 317:   scheduler_->ScheduleLoop();
(6): /opt/bess/core/bessd(_Z10run_workerPv+0x72) [0x67cb02]
    run_worker(void*) at /opt/bess/core/worker.cc:331
      -> 331:   return ctx.Run(_arg);
(7): /opt/bess/core/bessd() [0xd7d34f]
    execute_native_thread_routine at thread.o:?
(8): /lib/x86_64-linux-gnu/libpthread.so.0(+0x76b9) [0x7fa906f0f6b9]
    start_thread at ??:?
(9): /lib/x86_64-linux-gnu/libc.so.6(clone+0x6c) [0x7fa9065213dc]
    clone at /build/glibc-bfm8X4/glibc-2.23/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:109
        (file/line not available)

At first the run command worked fine but ~30s later the daemon would die with the segfault error above.

Recently it started dying as soon as I try to run this config file with this error:

A critical error has occured. Aborting...
Signal: 11 (Segmentation fault), si_code: 2 (SEGV_ACCERR: invalid permissions for mapped object)
pid: 6529, tid: 6536, address: 0x7fbebc005640, IP: 0x7fbebc005640
Backtrace (recent calls first) ---
(0): [0x7fbebc005640]
sangjinhan commented 6 years ago

Could you confirm if this still is a lingering issue? I tried to reproduce this issue locally (a couple of times last year, and today), but I have been not able to.

I vaguely remember seeing a similar issue, and the cause was the kernel module and the BESS daemon were out of sync; the problem was gone once I rebuilt the module from scratch.

rware commented 6 years ago

Yes and no. I'm not sure what was going on in my original issue, but I did notice I get a segfault and BESS dies when I set the queue size to be something other than a power of 2. Instead of getting an error that queue size shouldn't be a power of 2, I get a segfault and the BESS daemon dies. I've reinstalled BESS on a new machine with a fresh install of Ubuntu, so I don't think things being out of sync is the issue.

Here's an example config I used to reproduce this error:

import os

bess.add_worker(wid=1, core=1)
bess.add_tc('rl', policy='rate_limit', wid=1, resource='bit', limit={'bit': int(1e6)})

os.system('docker pull ubuntu > /dev/null')
os.system('docker run -i -d --net=none --name=vport_test ubuntu /bin/bash > /dev/null')

# Alice lives outside the container, wanting to talk to Bob in the container
v_bob = VPort(ifname='eth_bob', docker='vport_test', ip_addrs=['10.255.99.2/24'])
v_alice = VPort(ifname='eth_alice', ip_addrs=['10.255.99.1/24'])

PortInc(port=v_alice) -> q::Queue(size=10) -> PortOut(port=v_bob)
PortInc(port=v_bob) -> PortOut(port=v_alice)
q.attach_task(parent='rl')

And here's the output when I run:

*** Error: Unhandled exception in the configuration script (most recent call last)
  File "/opt/bess/bessctl/conf/test.bess", line 13, in <module>
    PortInc(port=v_alice) -> q::Queue(size=10) -> PortOut(port=v_bob)
  File "/opt/bess/bessctl/commands.py", line 132, in __bess_module__
    return make_modules([module_names])[0]
  File "/opt/bess/bessctl/commands.py", line 112, in make_modules
    obj = mclass_obj(*args, name=module, **kwargs)
  File "/opt/bess/bessctl/../pybess/module.py", line 58, in __init__
    self.choose_arg(None, kwargs))
  File "/opt/bess/bessctl/../pybess/bess.py", line 425, in create_module
    return self._request('CreateModule', request)
  File "/opt/bess/bessctl/../pybess/bess.py", line 258, in _request
    raise self.Error(code, errmsg, query=name, query_arg=req_dict)
*** Error: RPC failed to localhost:10514 - <_Rendezvous of RPC that terminated with (StatusCode.UNKNOWN, Stream removed)>
From /tmp/bessd_crash.log (Wed Jan 10 15:31:44 2018):
A critical error has occured. Aborting...
Signal: 11 (Segmentation fault), si_code: 2 (SEGV_ACCERR: invalid permissions for mapped object)
pid: 23368, tid: 23387, address: 0x7f9ff00003f8, IP: 0x7f9ff00003f8
Backtrace (recent calls first) ---
(0): [0x7f9ff00003f8]
  Command failed: run test

And a "pgrep bessd" shows the BESS daemon isn't running anymore.

loganwhite commented 5 years ago

Hi, I'm facing with the same problem when I try the examples posted on https://github.com/NetSys/bess/wiki/Hooking-up-BESS-Ports#connecting-vms-and-containers-with-bess-vports. The logs are shown below. Could anyone help me to solve this problem?

  1 Log file created at: 2018/12/19 14:48:04
  2 Running on machine: 
  3 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
  4 F1219 14:48:04.838188  9890 debug.cc:405] A critical error has occured. Aborting...
  5 Signal: 11 (Segmentation fault), si_code: 1 (SEGV_MAPERR: address not mapped to object)
  6 pid: 9869, tid: 9890, address: 0x140, IP: 0x563a9e5eea4e
  7 Backtrace (recent calls first) ---
  8 (0): /home/???/bess/core/bessd(_ZN5VPort11RecvPacketsEhPPN4bess6PacketEi+0x19e) [0x563a9e5eea4e]
  9     VPort::RecvPackets(unsigned char, bess::Packet**, int) at /home/???/bess/core/drivers/vport.cc:620
 10          617:     pkt = pkts[i] = bess::Packet::from_paddr(paddr[i]);
 11          618:
 12          619:     tx_desc = pkt->scratchpad<struct sn_tx_desc *>();
 13       -> 620:     len = tx_desc->total_len;
 14          621:
 15          622:     pkt->set_data_off(SNBUF_HEADROOM);
 16          623:     pkt->set_total_len(len);
 17 (1): /home/???/bess/core/bessd(_ZN7PortInc7RunTaskEP7ContextPN4bess11PacketBatchEPv+0xeb) [0x563a9e67fe1b]
 18     PortInc::RunTask(Context*, bess::PacketBatch*, void*) at /home/???/bess/core/modules/port_inc.cc:121
 19       -> 121:   batch->set_cnt(p->RecvPackets(qid, batch->pkts(), burst));
 20 (2): /home/???/bess/core/bessd(_ZNK4TaskclEP7Context+0x53) [0x563a9e495f23]
 21     Task::operator()(Context*) const at /home/???/bess/core/task.cc:53
 22       -> 53:   struct task_result result = module_->RunTask(ctx, &init_batch, arg_);
 23 (3): /home/???/bess/core/bessd(_ZN4bess16DefaultScheduler12ScheduleLoopEv+0x1cc) [0x563a9e4d178c]
 24     bess::DefaultScheduler::ScheduleLoop() at /home/???/bess/core/scheduler.h:272
 25       -> 272:       auto ret = (*ctx->task)(ctx);
 26      (inlined by) bess::DefaultScheduler::ScheduleLoop() at /home/???/bess/core/scheduler.h:250
 27       -> 250:       ScheduleOnce(&ctx);
 28 (4): /home/???/bess/core/bessd(_ZN6Worker3RunEPv+0x20d) [0x563a9e4ceb7d]
 29     Worker::Run(void*) at /home/???/bess/core/worker.cc:316
 30       -> 316:   scheduler_->ScheduleLoop();
 31 (5): /home/???/bess/core/bessd(_Z10run_workerPv+0x7c) [0x563a9e4cee4c]
 32     run_worker(void*) at /home/???/bess/core/worker.cc:330
 33       -> 330:   return current_worker.Run(_arg);
 34 (6): /home/???/bess/core/bessd(+0xc82a6e) [0x563a9eda6a6e]
 35     execute_native_thread_routine at thread.o:?
 36 (7): /lib/x86_64-linux-gnu/libpthread.so.0(+0x76da) [0x7f8709f686da]
 37     start_thread at /build/glibc-OTsEL5/glibc-2.27/nptl/pthread_create.c:463
 38         (file/line not available)
 39 (8): /lib/x86_64-linux-gnu/libc.so.6(clone+0x3e) [0x7f87094d788e]
 40     clone at /build/glibc-OTsEL5/glibc-2.27/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 41         (file/line not available)