Updates for OcPoC DF Stability

alexshirley commented 7 years ago

Our recent testing discovered some issues with DriverFramework on OcPoC hardware:

The HMC5883 datasheet (pg. 12) recommends sampling at no more than 75 Hz in single measurement mode when not monitoring the DRDY interrupt pin, and in general there is no interrupt pin from the external HMC5883 normally packaged with many popular external GPS modules. On OcPoC hardware, there are also occasional glitches in the I2C line during the _measure() callback, and the subsequent write single-measurement setting would lock up the kernel just long enough to cause an Accel #0 TOUT! error. For this reason, we set the HMC5883 into continuous measurement mode during hmc5883_init() so we no longer have to continuously write the measurement mode after every read.
As PR #155 disabled the scheduling adjustment for embedded platforms, we have now disabled that adjustment for OcPoC hardware as well.
During testing, there seemed to be a memory leak which would cause all DriverFramework devices to fail simultaneously leading to uncontrolled flight. Upon stopping PX4 we would find that the SPI devices were sometimes still accessible from testing scripts, and the I2C devices would entirely segfault or return corrupted data. It’s suspected that when the bus drivers leave their file descriptors open they continually allocate memory until the devices are closed. It is a poor practice to leave device files open for the majority of the operation, as that ties up machine resources. Thus we have introduced opening and closing of the device files during normal I/O transactions, keeping in accordance with NASA’s ‘94 C-Style guide, Chapter 8.2 as “Free allocated memory as soon as possible”.

Note that we have kept all changes board-specific to OcPoC to avoid unnecessary changes or unintended consequences on other hardware, but these changes may be worthwhile to test for more generic implementation.

bkueng commented 7 years ago

The HMC5883 datasheet (pg. 12) recommends sampling at no more than 75 Hz in single measurement mode when not monitoring the DRDY interrupt pin, and in general there is no interrupt pin from the external HMC5883 normally packaged with many popular external GPS modules. On OcPoC hardware, there are also occasional glitches in the I2C line during the _measure() callback, and the subsequent write single-measurement setting would lock up the kernel just long enough to cause an Accel #0 TOUT! error. For this reason, we set the HMC5883 into continuous measurement mode during hmc5883_init() so we no longer have to continuously write the measurement mode after every read.

@julianoes you're actually in a better position to review this: should we generally switch to continuous mode?

Upon stopping PX4 we would find that the SPI devices were sometimes still accessible from testing scripts, and the I2C devices would entirely segfault or return corrupted data. It’s suspected that when the bus drivers leave their file descriptors open they continually allocate memory until the devices are closed

Seems to me there is more going wrong than just memory that is not freed until a close. Can you check with cat /proc/meminfo if one of the values keeps increasing? I'd like to have your hypothesis confirmed, otherwise it could very well be that the changes here just let the problem appear less often. In general I'd keep the FD's open since we regularly access them with high frequency (I would not consider them as unused). Reopening causes unnecessary overhead if not strictly required. But I'm not sure what the recommondation/best practise for this use-case is.

alexshirley commented 7 years ago

The HMC5883 set to continuous-mode seemed to be an obvious change for us, especially when we were seeing issues like:

59934994 error: read register reports a read of -1 bytes, but attempted to set 6 bytes ERROR [sensors] Accel #0 fail: TOUT! (then back to normal)

When we print out the timing of the reads and writes, the I2C line was occasionally getting hung at ~18000 us. With the proposed change we no longer see the 'Accel #0 TOUT!' but we still see occasional hangs on the I2C line.

Disabling the Schedule adjustment seems obvious to us

We checked 'cat /proc/meminfo' and that didn't seem to have any major changes. You are correct, however; our fix seemed to only delay - or decreased lambda, until total sensor failure. On Friday we had 5+ hours of stability. On Monday we had a failure within 2 minutes of takeoff.

What we see:

60809136 error: read register reports a read of -1 bytes, but attempted to set 6 bytes ERROR [sensors] Accel #0 fail: TOUT! ERROR [sensors] Gyro #0 fail: TOUT! ERROR [sensors] Mag #0 fail: TOUT! (now all sensors are TOUT indefinitely)

When we print out the timing for reading/writing on the HMC5883 device, there is no tell or sign that total failure is imminent.

The question now is: "What has control over the sensors to force all of them into permanent timeout?" As previously stated, upon quitting Px4 we can usually access the SPI devices without problem, but the I2C devices typically exhibit strange behavior, if any at all.

julianoes commented 7 years ago

@julianoes you're actually in a better position to review this: should we generally switch to continuous mode?

Ok sure. I don't remember what the reasons where, either having the highest rate possible or matching PX4/Firmware (the Nuttx driver).

bkueng commented 7 years ago

@alexshirley I agree with your first 2 points. I can pull them in already if you'd like - enabling continuous mode only for OcPoC for now until we have further tests on other boards.

3rd point: sounds to me like a kernel problem or even HW-related. I'd investigate in 2 directions:

try different kernel versions (newer and/or olders)
try running px4 using only a single sensor at a time, disabling all others (the rgbled accesses I2C as well, so it should be considered too). If you don't have issues anymore, it points to a kernel problem since the DF drivers access different device files, and they run on the same thread (so they should not interfer with each other, except for different timings).

alexshirley commented 7 years ago

Let's move ahead on pulling the first two.

We've spent the last few days trying to figure out what's going on with the third point. Some additional notes:

We can run everything quite stably, but the HMC5883. I've thrown together a user-level HMC5883 driver to constantly pull data and print to console, and that has repeatedly run overnight without issue. Fundamentally, I don't think there's a difference between my driver and the current PX4 HMC5883 Driver, which made us suspect that it a scheduler issue. What's most baffling is that this issue doesn't seem to appear on APM either.
Since all the devices do run on the same thread, but all run on different device-files, if one device hangs indefinitely, won't they all hang?

bkueng commented 7 years ago

I cherry-picked the first 2 commits to master.

Since all the devices do run on the same thread, but all run on different device-files, if one device hangs indefinitely, won't they all hang?

That's correct which is why you see the other sensors timeout as well. You've probably done this, but does the kernel give you some indications after a failure, using dmesg?

alexshirley commented 7 years ago

We hadn't previously seen anything in DMESG, probably because everything else fell apart. But today I was able to catch the error in DMESG -

Unable to handle kernel NULL pointer dereference at virtual address 00000004

pgd = de184000

[00000004] *pgd=1d7b8831, *pte=00000000, *ppte=00000000

Internal error: Oops - BUG: 17 [#1] PREEMPT SMP ARM

Modules linked in:

CPU: 1 PID: 1629 Comm: DFWorker Not tainted 4.0.0-rt6-xilinx-00087-g634c857-dirty #23

Hardware name: Xilinx Zynq Platform

task: de0f2840 ti: dd748000 task.ti: dd748000

PC is at __xiic_start_xfer+0x6b4/0x7b8

LR is at __xiic_start_xfer+0x680/0x7b8

pc : [<c03d7e2c>]    lr : [<c03d7df8>]    psr: 800f0013
sp : dd749eb0  ip : c0846110  fp : 000f4f08
r10: c07d44c0  r9 : dd749f34  r8 : e09c0000
r7 : c07d6f44  r6 : e09c0000  r5 : 0000001f  r4 : dd8d8c10
r3 : 00000000  r2 : 00000000  r1 : 00000000  r0 : 0000000e

Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user

Control: 18c5387d  Table: 1e18404a  DAC: 00000015

Process DFWorker (pid: 1629, stack limit = 0xdd748218)

Stack: (0xdd749eb0 to 0xdd74a000)

9ea0:                                     dd8d8c10 00000000 00000004 00000001
9ec0: dd8d8c38 c03d83f0 000f6e88 00000001 00000000 c0071fa0 00000000 c00381c8
9ee0: ffffffff 00000000 dd8d8c38 c0846110 0003af1f dd749f34 00000001 c03d28e8
9f00: dd8d8c38 dd749f34 00000001 00000001 00000001 b6c6dd10 00000000 c03d3258
9f20: 00000001 00000001 dd601600 c03d32c8 dd601600 0000001e 00000001 dd66d6c0
9f40: dd66d6c0 c03d5008 dd083800 b6c6dd10 dd749f80 c00c2434 dd083800 b6c6dd10
9f60: 00000001 00000000 00000000 dd083800 dd083801 00000001 b6c6dd10 c00c29c0
9f80: 00000000 00000000 00000001 00000005 00000000 00000001 00000004 c000dfa4
9fa0: dd748000 c000de20 00000005 00000000 00000005 b6c6dd10 00000001 00000008
9fc0: 00000005 00000000 00000001 00000004 00100160 000dfea0 b6c6dd10 000f4f08
9fe0: 00000000 b6c6dd00 b6f0e4e9 b6f0e4f0 800f0030 00000005 11f2da41 4887b015

[<c03d7e2c>] (__xiic_start_xfer) from [<c03d83f0>] (xiic_xfer+0x78/0x140)
[<c03d83f0>] (xiic_xfer) from [<c03d28e8>] (__i2c_transfer+0x54/0x84)
[<c03d28e8>] (__i2c_transfer) from [<c03d3258>] (i2c_transfer+0x88/0xc0)
[<c03d3258>] (i2c_transfer) from [<c03d32c8>] (i2c_master_send+0x38/0x48)
[<c03d32c8>] (i2c_master_send) from [<c03d5008>] (i2cdev_write+0x40/0x54)
[<c03d5008>] (i2cdev_write) from [<c00c2434>] (vfs_write+0xb4/0x188)
[<c00c2434>] (vfs_write) from [<c00c29c0>] (SyS_write+0x3c/0x7c)
[<c00c29c0>] (SyS_write) from [<c000de20>] (ret_fast_syscall+0x0/0x34)
Code: e282200c e5842280 eafffe68 e5943280 (e1d320b4)
---[ end trace 0000000000000002 ]---

I'm looking into building a new kernel, since this is very likely caused by the the Xilinx Kernel Driver

alexshirley commented 7 years ago

We patched our kernel and for the most part things have become very stable. However, we do have a second board (an experimental breakout board) which still gets ACCEL & GYRO TOUT! every few hours. Unlike the previous failure, sometimes they completely recover, and sometimes they TOUT forever. I've looked at DMESG when that occurs and there seems to be nothing wrong.

Since that doesn't seem to be a part of this PR. I'm just going to close it in 24 hours or so.

PX4 / DriverFramework

Updates for OcPoC DF Stability #216