Openvario / sensord

Daemon to poll air data from Openvario sensorboard
6 stars 11 forks source link

Blips in MS5611 sensor data #23

Open hpmax opened 4 years ago

hpmax commented 4 years ago

When I turn on my OpenVario, I will see a lot of +/- blips on the vario, every few seconds, with a magnitude in the neighborhood of .5 m/s.

I decided to record the data out of the MS5611s and found that upon boot up, I'd see a spike high (of about 35-40 Pascals) followed by an immediate low spike of perhaps 10-15 Pascals. The high pulse lasted about 2 samples, and the low pulse lasted perhaps 10 samples.

Interestingly, the effect occurred on BOTH MS5611s simultaneously. and it repeated every 92-94 cycles (it varied).

I killed the daemon and restarted it, and the blips mostly or completely went away. I recorded twice, once for over 20 minutes and never saw a blip, once for over a minute and only saw one blip. This seems almost completely repeatable for me.

Blue trace is the restart, the other 3 were what it does on bootup. blips.zip blips

mihu-ov commented 4 years ago

These blips (and incorrect pressure reading spikes) are an unsolved mystery and further investigation would be more than welcome. One known issue is RF noise from the VHF transmitter but this seems to be something different.

hpmax commented 4 years ago

@mihu-ov My glider has a Dittel FSG-2T but the unit was off. It is also equipped with a Stratux (which has no transmitter), and a SoftRF unit, both of which were on. The SoftRF is capable of transmitting at 900-ish MHz although its a pretty low power output. I also have multiple switching converters going from 12V to 5V to among other things power the SoftRF and Stratux... But their state should have been unaffected by stopping an re-starting the daemon, so this seems unlikely.

So you are saying this is a known issue? I'm surprised more hasn't been done by this, this seems like a pretty significant problem. The blips do get through the Kalman filter and are quite noticeable.

mihu-ov commented 4 years ago

In my setup I have no issue with blips and only minimal effects when transmitting on VHF. In one of our club gliders (DG300) there is an occasional blip and vario goes wild when transmitting. Your traces are probably an extreme case - that´s great for investigating the problem (RF issues or anything else).

hpmax commented 4 years ago

Again, I see absolutely no reason to believe this is RF (at least from outside the OV) related. My VHF transceiver was off at the time. Those waveforms are 100% repeatable. When I start the OS, the red, orange, and purple traces are what happen. You can see it is regular. They are occurring every 92 to 94 samples. When I kill the process and restart the blips either mostly or completely go away. I do not see how an external device could be impacting that.

blips2

Blue is TE, Orange is static... out of the same data file.

mihu-ov commented 4 years ago

I agree, I don´t believe your case is RF induced. https://github.com/PX4/Firmware/issues/7311 descibes a similar effect https://px4.github.io/Firmware-Doxygen/d6/d17/ms5611_8cpp_source.html is the code used in PX4 Maybe a comparison what they do differently would make sense? Is there any difference in timing of sensor reads during the blip? What is the sensor temperature readout at each sample?

hpmax commented 4 years ago

There are some signs that point to this, but others that don't:

1) If the problem was inadequate wait between conversion and read out, its likely it would affect both sensors. It does. 2) If a temperature conversion was corrupted that temperature data is used for something like 10 pressure samples before it's re-sampled. That's about the length of the blip.

None of this explain why only temperature conversions are corrupted or why it's periodic, or why restarting makes the problem go away.

But here's the other thing... Even if the data is somehow corrupted, why is it consistently increasing for a couple samples and then consistently decreasing for more samples. I mean, okay, maybe we have a bad temperature conversion, this results in bad compensation value, and the next pressure read is consistently high, and the temperature compensation corruption is consistently decreasing it until the next temp conversion... but there are multiple high samples prior to multiple low samples, and the shape isn't very consistent.

Anyway, I'll try a few experiments:

1) modify the code to dump out the measurement counter, and the temperature 2) increasing the usleep time to say 20ms.
3) Record time between conversion and read.

hpmax commented 4 years ago

Okay, here's what I found:

I tried everything listed above and found nothing of use. Increasing the usleep time may have decreased the magnitude of the problem. The problem occurred at random meas_counter values, and was not tied to a temperature. I tried adding a usleep at the beginning, and tried resetting in the middle of the run. And while the time between conversion and read was not terribly consistent, it was always longer than the requested usleep value. In other words, that whole exploration was a bust.

But, I did find something very interesting out.

If sensord is started with: systemctl start sensord.service (or restart) it has the blips If sensord is started with: /opt/bin/sensord -c /opt/conf/sensord.conf it does NOT have the blips.

This seems 100% repeatable. Doesn't matter if the service is disabled or enabled, doesn't matter if it's started, stopped, started again, or restarted, or whatever. Similarly, when I run it from the command line, it appears to run clean.

If I edit the /lib/systemd/system/sensord.service to change it from Type = forking to Type = simple, and change the command line to /opt/bin/sensord -f -c /opt/conf/sensord.conf it does NOT have the blips.

I disabled the service and added: "/opt/bin/sensord -c /opt/conf/sensrord.conf" to the ovmenu-ng.sh script and it appears that is a workaround.

So something about systemctl type=forking is causing the problem. My solution is clearly a hack work-around, but hopefully someone who knows Linux better than me might have some ideas.

kedder commented 4 years ago

Is sensord printing anything to stdout periodically?

hpmax commented 4 years ago

I'm not sure what you mean. I didn't pipe stdout to anything, and if it's run through systemctl I won't see stdout. I don't recall seeing anything when running it from the command line, and wouldn't think I would unless I put in -f mode, which I didn't try. What were you expecting/looking for?

kedder commented 4 years ago

Just trying to think up anything that might be different between running from systemd and console.

hpmax commented 4 years ago

I don't know, I was talking to a friend who is way more Linux knowledgeable than me, and he was stumped too, but thought maybe its run with different privileges or environmental variables or something. But since it only fails as "type=forking" that doesn't make much sense... And I'm completely stumped why I'm the only person who seems to be having this problem.

hpmax commented 4 years ago

If you can think of something specific to try, I can certainly perform an experiment, but I went through everything obvious. I still have no idea what is causing the erroneous readings let alone why systemctl has anything to do with it.

hpmax commented 4 years ago

I figured something out...

The foreground code writes all the log data to stdout which is the terminal: // open console again, but as file_pointer fp_console = stdout; stderr = stdout; setbuf(fp_console, NULL); setbuf(stderr, NULL);

    // close the standard file descriptors
    close(STDIN_FILENO);
    //close(STDOUT_FILENO);
    close(STDERR_FILENO);   

The daemonizing code writes all the log data to sensord.log: // close the standard file descriptors close(STDIN_FILENO); close(STDOUT_FILENO); close(STDERR_FILENO);

    //open file for log output
    fp_console = fopen("sensord.log","w+");
    stderr = fp_console;
    setbuf(fp_console, NULL);

I tried replacing the open file for log output from the daemonizing version to the foreground version and then ran it as a daemon through systemctl (type=forking) and... it looks like it worked. blipsnew


Unfortunately, I also noticed a different kind of blip that I hadn't seen before. It only affects the static sensor, always seems to occur on both meas_counter==6 and 26 (and only those two) and its much larger. In fact, the result almost seems random although more often than not its over 123000 and gets ignored by the pressure_read routine.

mihu-ov commented 4 years ago

Wow, great investigation, thank you! Ignoring obvious outliers was a rough hack but finding the root cause would be much better. There is some interesting reading on the effects of sample rate jitter on https://www.freesteel.co.uk/wpblog/2014/11/26/ms5611-altitude-sensor-very-sensitive-to-own-working-temperature/ and also http://blueflyvario.blogspot.com/2012/06/latency-and-timing.html

hpmax commented 4 years ago

I didn't get much out of the bluesky thing. A lot of his blog seems to be him noticing that filters can cause latency. Stuff like that is well understood, and manageable. The freesteel one on the other hand was definitely interesting. That doesn't look exactly like I am seeing, but its definitely similar. When I was looking at timing data I was looking for a short time, not an irregular time. All the times are irregular (although typically by less than 1ms) and by looking at the data I've already collected I can't see an obvious pattern (but the data I have is time from conversion to read, not from conversion to conversion).

I think others have pointed out that the way we are accessing the sensors isn't great. It's very difficult to precisely control timing on a non-real time system, and to do this properly the conversions would have to be done in hardware. Despite that, I think we could better improve timing control if that was necessary.

One obvious thing would be to monitor the time between samples and adjust usleep to better target a 12.5 ms interval. In other words, check the time immediately after waking up from usleep, and then check the time right before going to usleep, and adjust the usleep time to be target time-awake time.

Another possible thing would be to run this as nice -10. If for some reason timing is getting screwed up by a disk access (as might be implied by the fact that not writing to the log seemed to fix the problem), I'm not sure that changing the nice level will help since the disk access might have higher priority than even nice -19 (or whatever maximum un-niceness it).

What really bothers me is that I seem to be the only person who is affected. It's hard to believe... I'm using the Stefly hardware and the stock 20193 release and seeing the problem.

hpmax commented 4 years ago

Alright, I'm out of ideas at this point.

I tried nice -15 it didn't make much if any difference.

The glitches that I described two comments ago, were caused by the code change, which is what fixed the blips.

I'm out of ideas, it seems these things are finicky. As a workaround, I am successfully running it with systemctl as Type=simple with the -f command line option. No glitch, no blips, and its only a minor change (edit to /lib/systemd/system/sensord.service) and it's good-to-go. That's how I plan to do it going forward.

I did notice we are recalculating the sensitivity and offset (the MS5611 internals) each time we read the pressure. Since these are uniquely tied to data that is calculated after a temperature read, it seems like it makes more sense to calculate these after each temperature read. This shouldn't result in incorrect results, we're just repeating calculations that don't need to be repreated.

I still would like to experiment with higher sampling rates (and averaging) and improving the accuracy of timing. If we sampled on each 12.5ms clock, in theory that would reduce the standard deviation of the error by a factor of 2 with no major cost. Another question is... is there a good reason why we are putting out updates every 50ms. Does a vario actually benefit from being that fast, or would it be worth trading off some response rate for better filtering (albeit, I don't have a real good concept on the response rate of the Kalman filter).

kedder commented 4 years ago

If removing logging removes the blips, I'd suspect filesystem write cache flushes for this effect. Just for fun, try to redirect log to /var/volatile, e.g. like this:

ln -sf /var/volatile/log/sensord.log /sensord.log

and restart the device. Will it make any difference?

kedder commented 4 years ago

Also, what is the contents of that log, actually, when you see the blips. Is sensord writing to that log when blips are happening?

hpmax commented 4 years ago

Its not clear to me that removing logging fixes it. But I can definitely try the symlink next time I try.

I don't think it's writing anything to sensord.log... Just a bunch of stuff when it's kicked off. There's a -r command line option (or something like that which records the output. That's what I was using. I modified the string -r puts out. I can't tell that the -r write is affecting things. I don't think much of anything is being written to sensord.log (/home/root/.xcsoar/logs/sensord.log).

In general I have not tried correlating sensor data to vario output. But I have no reason to believe the blips in the raw data weren't causing the vario blips. When the sensor is "quiet" there are no vario blips.

iglesiasmarianod commented 4 years ago

Hi guys, had a similar issue some time ago. Periodically one of my pressure sensors spiked the vario. Not 5 m/s but a solid 2 or 3 m/s every 3 or 4 seconds. In my case it was a bad solder joint. Removed both sensors, resoldered them and my problem went away. Perhaps a resolder of both sensors might help you too. Another thing I would try is removing them one by one to see if one of them is faulty.

hpmax commented 4 years ago

Thanks for chiming in. Where did you get the hardware (and when)?

The blips I was seeing on the vario were about 0.5 m/s tops. I assume you didn't get to the point that you looked at the data as closely as I did (ie could confirm the simultaneous blips on static/TE and that the blips looked like my blips)...

iglesiasmarianod commented 4 years ago

You're welcome! I've built my own boards following the schematics. This was around november/december 2019. I had three assembled boards at that time and only one did that strange vario twitch. I tested the boards with i2cdetect only. I didn't dug deeper because the other boards worked OK with the same mainboard and same software image.

hpmax commented 4 years ago

I was using the stefly hardware, which I presume was assembled in a factory, so it seems unlikely there was a soldering issue. I was thinking if you had ordered from Stefan that perhaps there might have been a bad batch.

hpmax commented 4 years ago

I recorded the NMEA output and it was a bit disturbing when I looked at the graphs:

firste

This is a recording of the POV,E statement (i.e. vario output).

I have no idea what that hash is from 6500 to 10000. There are also 19 spikes to over 4 m/s. I don't actually understand this. I don't recall hearing any of the spikes while recording the data... and for that matter, I wouldn't have expected this much data. I would expect 20 per second, and that implies I recorded over an hour, and I think I recorded for about 15 minutes max. So... wuuuhat?

After 70000 it gets noisier. At some point toward the end (which I assume is that portion) I re-started the daemon and turned on temperature readings (see my next PR). Now, I did something a bit stupid here. I initiated the temperature conversion AFTER initiating the MS5611 conversions. That's probably a no no given the comments made on the Freesteel website. I tried a different variant which initiated the temperature conversion before the pressure conversions and...

std deviation with no temp conversions ~ .019 std deviation with temp conversion (initiate before pressure) ~.0315 std deviation with temp conversion (initiate after pressure) ~ .034

As you might expect, standard deviation of temperature increases from .06 degrees C when initiated before pressure to .11 degrees after.

And yes, when looking at the data in the second run which was only 4582 samples (4 minutes?) there was still a single blip -- with a magnitude of about +/- .15 mbar in the static (this of course is impacted by the IIR filter)

I'm thinking all these things are dumping noise into ground and it is sensitive to the quality of the ground.

Right now, we are performing both MS5611 conversions simultaneously reading it out and then waiting. There's no reason why we can't interleave the pressure readings.

The more I am looking at this, the more it feels like the hardware should be redesigned.

iglesiasmarianod commented 4 years ago

@hpmax I assume you recorded a sensord log file and are tracing it with excel, right? I've recorded a log file some time ago and uploaded it here:

https://github.com/Openvario/variod/issues/15

It is nearly at the bottom of the thread. Perhaps yo can analyze my file and see if my device also has these spikes. Or just tell me what kind of log you need and I'll se how to get it and send it to you.

hpmax commented 4 years ago

All the plots were done in Octave (GNU copy of Matlab). Probably could have used Excel or LivreOffice, but I like Octave.

The first few plots were from data out of sensord, using the equivalent of the -r command line option (ie raw and unfiltered). The last plot was from "Debug" of the NMEA stream from XCSoar with a Perl script to strip out unnecessary values.

I'm about half done a re-write that will poll the sensors more frequently, regularly, and fully interleaved.

I really hope I'm not trying to chase down a hardware problem in software. I don't understand how others don't seem to be running into this problem.

iglesiasmarianod commented 4 years ago

Great! Thanks! My log is raw sensord -r dump. It has around 3000 samples. I just loaded it in Excel and saw two spikes with a shift in the mean value. This is how it looks.

image

hpmax commented 4 years ago

So that doesn't look like the "blip" to me but it does look like it might be a "spike". The blips were in the magnitude of .3-.5. I never saw one that large and they always started with a positive component. But the spikes that were happening more irregularly do look kind of like that.

Also interestingly, yours looks like it jumps downward afterward.

hpmax commented 4 years ago

I made a few variants of the software to test different sampling strategies. Running the daemon from the command line, I used: /opt/bin/sensord -c /opt/conf/sensord.conf -s -r /home/root/testdata

I've never seen the chips below 30C, so the second order compensation doesn't do squat, but I see little reason not to use it just in case. I'm not even sure why this is set as an option to turn it on. It should just be on, there's virtually no penalty when you don't need it, and the penalty is well worth it if you do.

I took a moving average (+/- 32 samples) and subtracted that out from the original data to try to remove the sensor drift.

My three sampling strategies were:

1) What we do now. Convert both static and TE, wait 12.5ms, read out, then wait another 37.5ms, record values, and then wait 12.5ms and repeat the whole cycle. 2) This alternates between reading out static and then starting a conversion of TE with reading out TE and starting a conversion on static. Every 40 cycles I do a simultaneous conversion on temp for both sensors, but other than that, one sensor is converting, and the other is sitting idle in each time slot. 3) Do a conversion on both sensors simultaneously, each time slot, except the 40th slot in which you do a temperature conversion.

There should be a benefit to running the sampling algorithm faster. Assuming the noise is uncorrelated and IID, the standard deviation will go down with the square root of the number of samples included. In other words, if you take 4000 samples and average them out into 1000 bins of 4, the standard deviation should be half what it was if you only had 1000 samples taken at 1/4th the rate.

When I took the standard deviation of the static and TE sensors, I got:

1) TE = 1.3915 Pascals, Static = 1.4394 2) TE = 1.4267 Pascals, Static = 2.1465 3) TE = 2.0200 Pascals, Static = 2.0301

Side note: These numbers were calculated based on "digoutP", "sens", and "off" in Octave. Sensord's p number is slightly higher standard deviation (0.02 - 0.03 Pascals). I assume this is because of quantization error by doing all integer calculations.

So, now I tried averaging (2 in the case of #2, and 4 in the case of #3) 2) TE = 1.0876 Pascals, Static = 1.8383 (should expect 1.0088 and 1.5178 if uncorrelated) 3) TE = 1.5074 Pascals, Static = 1.5023 (should expect 1.01 and 1.0151 if uncorrelated)

Conclusions:

Clearly, the noise is NOT uncorrelated. Looking at the data the results with a ~1.4 STD seemed to show an interference pattern that looked a little like envelope modulation. The ~2.1 STD showed what looked like a triangular waveform superimposed on it It was very distinct, and fairly large. It is bizarre to me that the two MS5611s behaved differently in sampling variant #2.

I did not see any of the previously described behaviors this time. I collected 92800 samples on variant #1, 28300 samples on variant #2, and 103700 samples on variant #3.

Worst cases were (based on subtracting the moving average):

1) TE = 6.230769/-5.923077 Pascals, Static = 5.846154/-6.446154 Pascals 2) TE = 7.769231/-5.538462 Pascals, Static = 8.492308/-6.615385 Pascals 3) TE = 11.40000/-7.400000 Pascals, Static = 10.569231/-7.661538 Pascals

I'll also point out something interesting about the AMS5915. While datalogging, I have sometimes seen absurd dynamic pressures -8 billion or something like that. I have to assume this is some sort of numeric nonsense, since the sensor is limited to 50mbar (5000 Pascals). So I was trying to get to the bottom of it. So I checked over the ams5915.c code. It appears to be implementing things correctly, but it seems like they could have been implemented more intelligently. I wish I could catch it outputting something funny. But it did output something funny. My dynamic pressure is -0.3318 Pascals with an output word of 1551. The datasheet provides the following formula (1551 - 1638) * 50 /13107 = Pressure which is exactly what is calculated -- except that 1638 is supposed to be 0 pressure. Kinda weird. I guess its possible it works slightly below 0 and this is just some sort of sensor error/drift, but it has been pretty consistent down there.

I have a few more experiments on tap. I'm going to make the following variants:

1) A variant on #1 in which static and TE both sampled every 4 slots, but there is a 2 slot offset between them. 2) A variant on the one above, but with the order of sampling reversed. 3) A variant on #2 in which the order of sampling is reversed. 4) A variant on #3 in which the number of slots before temp sampling is increased.

If anyone else has suggestions on experiments, I'm all ears.

mihu-ov commented 4 years ago

Thank you for putting such an effort in getting the best data out of our sensors! Frankly, I don´t understand all the details yet but from a first try in interpreting your data:

So my current summary would be - please correct me if I am wrong:

hpmax commented 4 years ago

Regarding the AMS5915:

I haven't seen ANY blips on the AMS5915. The negative offset doesn't bother me, as long as this is just an expected negative offset and not a linearity issue or some other unknown issue. Unfortunately, I don't have access to calibrated pressure sources to do a multi-point cal on the sensors. The unreasonable numbers seem like it has to be something wonky in the math. The formula itself appears to almost too exactly copied from the datasheet. When you think about it, given the formula the actual range should should be something on the order of -5 to 20,000, if we assume we are converting a 16 bit integer by subtracting 1638 and then multiplying by 50/13107. The -8 billion reading was a static value reading when sitting there. I could always try every possible input to the formula and see what replicates the result (i.e. each read results in a sensor->digoutP++;

Regarding the MS5611:

I agree that 1.4 pascal standard deviation seems like a reasonable amount of noise, and may well be the baseline for the sensor, i.e. the only way we can do better is oversampling and filtering -- which has to be done, because if we are sampling at 20Hz, that means it's reasonable to expect that at some point we could see the aircraft "move" by 30 cm in 50 ms, which would be a booming thermal -- and it only gets worse as altitude increases. Buuuut....

There are multiple things bothering me:

Issue 1) The noise is clearly correlated from sample to sample. This is problematic because most filters (including the Kalman filter) are designed to remove uncorrelated Gaussian noise, and that's not what we're dealing with. In the case of the 2.1 pascal regime, noise went down to 1.84 and 1.50 pascal when averaging by 2 and 4 respectively, but it should have gone down to 1.51 and 1,02. In the case of the 1.4 pascal regime it went down to 1.12 and .785 pascal when it should have gone down to 1.02 and ,72 pascal. Based on this, I'd say that the true "noise" in the sensor is closer to 1.2-1.3 pascal and that's the best you're going to get out of it.

I tried averaging n number of samples, and then taking the standard deviation and multiply by n^.5 and got:

1: 1.4394 (un-averaged) 2: 1.5866 3: 1.5054 4: 1.5717 5: 1.3844 6: 1.5097 7: 1.4335 8: 1.4299 9: 1.3742 10: 1.34 11:1.3664 12: 1.3957 13: 1.3663

If the noise were un-correlated, we'd expect the numbers to stay roughly constant (note that as the averaging increases, the number of samples decreases, so this is likely to cause the standard deviation to increase slightly, but I have a lot of data so this shouldn't be too prominent). Are there multiple periodicities?

I took an FFT of the data... and all I can say is WOW!

Noise is 100% white with two exceptions...

Exception 1) There are huge noise spikes at certain frequencies. They were present in ALL recorded MS5611 data, no matter how I sampled. I'd estimate its showing up at 1/10th the sampling rate and every harmonic up to Nyquist. I'm not sure if its aliasing beyond the Nyquist. I'm a little suspicious that this is related to reading the temperature. I am going to try adjusting the sampling rate and how often the temperature is read.

Exception 2) Near the Nyquist rate, noise noticeably increases. The noise increase is more prominent when sampling faster, meaning that sampling faster in order to reduce noise may be counter productive. I'm not convinced of this though. When I compare the spectrum of the TE (1.4267) to the static (2.1465) in variant #2. It appears the ONLY difference is the magnitude of the periodic noise spikes. The rest of the spectrum might actually be quieter in the static than the TE.

I tried notching these frequencies out by hand (i.e. zeroing out any frequency component whose magnitude was above a threshold) -- and depending on how aggressive I got, the results on TE #1 were a standard deviation of between 1.23 and 1.25 pascals. Comparing Static#1 to TE #1, both saw reductions of 12.6-12.7% in standard deviation. I tried the same technique on Static #2 and got 1.3041. Averaging by a factor of 2 reduced it to .967, which isn't quite as good as I would have hoped, but that's still a 50% noise reduction over current results with no reduction in bandwidth -- I'll take that! I tried again on static #3 and noise reduced to 1.48 pascals, averaging by 4 reduced to .932. Clearly, there's still correlated noise in it. Looking at the spectrum, you can still see noise spikes on variant #3. But oh well, it's still a huge freaking improvement.

But what is this periodic noise to begin with? It really disturbs me that the magnitude of the spikes seems to change almost randomly. I see no obvious reason. Variant #2 was particularly disturbing where both sensors were seeing identical spikes, but one sensor magnitude was much higher than the other -- but reducing the sampling frequency and they are back to nearly the same FWIW, I went back and checked the FFT of the AMS5915, it was mostly white with a tinge of pink (although it doesn't look like flicker noise), but no spikes. The AMS5915 is showing a standard deviation of 16.7 millipascals. That might be a big deal if you aren't moving but at a reasonable stall speed, this translates to a worst case IAS noise error of .01 knots, and it only gets smaller the faster you go. I'd be far more concerned about overall accuracy.

Issue 2)

In addition to no explanation for the periodic noise, including why running through systemctl vs not through systemctl or -f vs not -f seemed to make a difference, there is also the question of what about the larger non-periodic/random spikes. I ran it for hours and saw nothing, but both of us have previously seen them.

If someone wants to volunteer to capture data using variants 1 and 2, I can publish them, and we can see if the same periodic noise is there and if its at the same frequency.

hpmax commented 4 years ago

New update:

AMS5915: The number I saw was actually something on the order of 8.1e36. I was unable to replicate it using ANY digoutp value, and also tried printing it uninitialized. All was fine. It's possible it was something stupid in that particular version of my code in the actual print statement.

MS5611: Blips are back. All my sensord variants are showing the blips. This includes running the exact same executables in the same way. Blips are on the order of 10-50 Pascals, well outside the range of 1.5 pascal standard deviation. I have no explanation.

Noise: I have not looked in detail yet, but I have run several variants on what I ran yesterday. The spike magnitudes vary with the usleep intervals. Generally, longer intervals give lower std dev. It's possible there is some sort of relaxation time after a conversion takes place. 12.5 ms may actually be fine. I noticed noise went up when I switched to 10ms, but not much change from 12.5ms to 20. What didn't change was the distance between the spikes in the FFT. Absolutely identical. If this were correlated to some noise event that was on the board that was not directly related to operating the MS5611, I'd think the distance would change. I also tried adjusting the number of pressure reads before a temperature read. That did shift the spikes.

I saw someone else comment (on a different project) that the MS5611 seems sensitive to timing. Any aperiodicity may be bad. I'm going to try alternating pressure and temperature reads. It's also possible that the timing could be tightened. For example recording the time right before the pressure read, and then checking the time before the usleep and doing a usleep for (pressure_read time + 12.5ms - current_time). Of course, usleep isn't super accurate, so you may have to come back a little early and screw around until you're at the appropriate time.

hpmax commented 4 years ago

New update:

I still don't know what causes blips, but I did find a variant which reduces noise... a LOT:

I sample pressure from one MS5611 and temperature from the other, and I switch every 12.5 ms time slot. So every 4 time slots, you get 2 pressure samples from each MS5611. Then average the two samples.

Pre-averaging, I have about 1.26 (TE) and 1.326 (static) pascals of noise. After averaging I get 0.88 and 0.93 pascals of noise. As a side bonus, dynamic noise drops from about 1 pascals to .78

Running an FFT on the data shows that there is one clear noise frequency and two harmonics. The harmonics are louder than the main frequency, which is pretty unusual. The 2nd harmonic is probably the only real contributor to the noise, and even then its not much. Zeroing out all the noise I could only reduced the std dev by 0.25% (i.e. not worth caring about).

hpmax commented 4 years ago

I had a thought about the noise and why specifically this sampling strategy may be the right one (this doesn't address the blips). I think whenever you do a temperature conversion it kicks an impulse into the output of the pressure sensor that decays over time. The reason variant #2 was behaving so oddly is that on time slot 40, I do a temperature conversion on both sensors. The next time slot I was doing a conversion on the static sensor, and the one after that the TE, then the static, then the TE, etc. So if you think of the impulse as adding X(e^-t/tau) pascals, where t = 12.5ms n, you can pretty clearly see that sampling on slots 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, etc is going to be a lot different (and worse) than 2, 4, 6, 8, 10, 12. The original algorithm was better, but still not great since you do the temp conversion on 3, and then the pressure conversions on 5, 9, 13, 17, etc. I bet that's the noise... the decaying exponential. By always sampling and adding: X*e^(-12.5ms/tau), you are going to get a larger offset, but it hopefully should be a constant offset. The key however would be reduction in timing jitter.

hpmax commented 4 years ago

Made some more progress... I decided to try to better monitor and control the timing. So each time through the pressure_measurement_handler I do the following, immediately prior to initiating the ms5611_start:

if (meas_counter&1) { ms5611_read_pressure(&sensor2);
ms5611_read_temp(&sensor1); do { usleep(1); clock_gettime(CLOCK_REALTIME,&cur_time); } while (DELTA_TIME_US(cur_time,prev_time)<20000); ms5611_start_pressure(&sensor1); ms5611_start_temp(&sensor2); } else { ms5611_read_pressure(&sensor1);
ms5611_read_temp(&sensor2); do { usleep(1); clock_gettime(CLOCK_REALTIME,&cur_time); } while (DELTA_TIME_US(cur_time,prev_time)<20000); ms5611_start_pressure(&sensor2); ms5611_start_temp(&sensor1); LOG/PROCESS DATA } prev_time=cur_time; meas_counter++;


And here's the log: 99568.00000 99613.00000 -0.33188 20053.38672 20009.51758 99568.00000 99614.00000 -0.33188 20179.50977 20014.52539 99568.00000 99615.00000 -0.33188 20030.25977 20000.45703 99566.00000 99614.00000 -0.34333 20060.53906 20009.04102 99567.00000 99615.00000 -0.33188 20047.66406 20003.31836 99620.00000 99614.00000 -0.32044 346722.84375 20020.24609 99560.00000 99607.00000 -0.32044 20090.10352 20070.55273 99558.00000 99604.00000 -0.29755 20047.90234 20061.97070 99559.00000 99604.00000 -0.30900 20030.49805 20061.73047 99561.00000 99607.00000 -0.30900 20007.13281 20007.84961 99561.00000 99608.00000 -0.30900 20000.69531 20048.14062 99563.00000 99606.00000 -0.32044 20077.94336 20036.69727

Column 1 is TE in Pascal Column 2 is Static in Pascal Column 3 is Dynamic in mbar Column 4 is DELTA_TIME_US of meas_counter==odd Column 5 is DELTA_TIME_US of meas_counter==even

Column 5 varied between 20ms and 30ms, although it was mostly close to 20ms. I saw no pressure errors as a result. Column 4 had 3 values between 300 and 415ms, each was associated with a large blip. The remainder were between 20 and 40ms, there MAY have been small blips, but it's hard to tell from just regular noise.

I wish I understood this better, but, at this point I'm thinking the following potential solutions:

1) nice -15 the process and hope this reduces the chances of the drastic timing faults. 2) Figure out a way to use interrupts or something to enforce more accurate timing 3) Recognize that there was a timing fault (i.e. a delay of more than 40ms -- or whatever) and if so try to reset the thing. I.e. do not do a start_pressure for at least 2 measurement counts which aren't exceeding the 40ms. 4) The proper fix would be use hardware to ensure the timing is proper.

mihu-ov commented 4 years ago

Re proper fix / use hardware to ensure timing: This was considered in the beginning but at the time we decided to go with the "good enough" solution, reading sensors with the Cubieboard. The main question is probably what we could gain with a dedicated microcontroller reading the pressure sensors. Is there a clear noise reduction in sensor data if we achieve accurate timing, and is there an appreciable effect on vario data after the Kalman filter. Do you have an Arduino / ESP32 or similar available to do a comparison?

hpmax commented 4 years ago

Without additional hardware I have improved the std deviation of the noise to about .9 pascals. That's pretty good, I doubt I could do too much better even with additional hardware.

The 300-500ms delays seem to cause a high spike followed by a low spike and then an upward ramp back to nominal which can take as much as a third or half a second. I can recognize the large delay and just blank the output for a while, but it might take as much as a half of a second to totally clear out, after it already was out of commission for a half second -- and I don't necessarily know when it's recovered (sort of). The high glitch doesn't surprise me. If you measure the temperature, and wait a half second, the part might cool off enough to screw up the reading. You might even be able to reduce the glitch by just using the next temperature reading rather than the previous. But, it's easier to just ignore the high glitch. What I don't understand is why it drops down and then ramps up despite the fact that it now has valid temp data. It is conceivable to me that it has low thermal mass, and after the initial pause, it takes some period of time for the temperature to re-stabilize now that the part is being exercised. if that's the case, I could monitor the temperature and assume its back to working once the cycle-cycle delta temp gets small.

I find it interesting that the major timing fault ALWAYS happens on val1. -- and I tried nice, and it didn't help. The combination of the two makes me suspect the time delay is occurring in the log/process data and NOT because of some other process.. It's almost a discrete thing. Makes me suspect its a write/file system issue. It's possible if no file write occurred AND you niced it, it would work almost always. I'd be happy with that! That's another experiment to try.

By the way, I found out more about the AMS5915 issue. Apparently, the ridiculous number was an offset value read out of the EEPROM -- which seems to only happen when you run in cmd_line -f mode. If you run without it, it doesn't load that calibration value. I'm not sure why it's only reading the EEPROM when it's in -f mode, or why the EEPROM value is that crazy.

kedder commented 4 years ago

I like the idea of discarding the measurement if we can detect that it was not timed right. But that would work if we can confirm that bad timing is the cause of the issue. We can confirm that by simple experiment: add a random artificial delay in during conversion. If we get a lot of spikes as a result - that might be it.

why it's only reading the EEPROM when it's in -f mode, or why the EEPROM value is that crazy.

I'm pretty sure it doesn't do anything that is not in the source code. Is there any condition like that in sources?

hpmax commented 4 years ago

I can try deliberately adding in delays and experimenting with that. But so far, I think I can already reliably detect and screen it out. Like I said, looking at the data, it's stands out like a sore thumb, If my hypothesis is correct though, a long delay like this will result in both sensors reading low for a couple hundred ms afterward and there's nothing you can do to fix it once the "long" delay occurs. It's also possible I can screen that out -- or perhaps there could be some sort of compensation, but screening it out means you're blind for that period of time, and you were already blinded by a few hundred ms. I think the real key is to figure out the source of the delay and eliminate --and like I said, I suspect its some sort of activity which involves accessing IO, the disk, the network or something which is blocking.

I wish we could do some sort of interrupt driven, and failing that a super high priority thread that JUST sends out the start conversion and read results and stores them in a shared memory, perhaps a circular buffer, that can be shared. Of course, unless you can mutually block each other this does risk a race condition of some sort.

Regarding the EEPROM issue. I couldn't figure out why it was only reading the EEPROM when using -f, but I don't think that's the case. The EEPROM read occurs after it "daemonizes", so it probably dumps out the EEPROM data to stdout in the child process and I never see it when running without -f. That said, when running with "-f" I see a crazy offset value for "total" (which I think is actually dynamic/pitot) that is displayed on stdout and then all the logged data reflects that as well (8.1e36 or whatever), and when I do not use -f the value is -.32 mbar or thereabouts.... So something weird is going on.

hpmax commented 4 years ago

I'm going to try aome experiments when I get home from work. However... Looking at the schematic, it lools like you might be able to just wire up an Arduino to the back of it and have the Arduino do all the data collection, interrupt driven... And then report it to sensord. That should drastically clean up the timing to the point that this would be a non-issue with no extra hardware hacking. Just an RJ45 into the back. Ugly hack, but cheap and probably effective.

hpmax commented 4 years ago

I added code that did the following: static float delay = 25000;

if (meas_counter&1) { do pressure/temp_readings } else { do pressure/temp_readings if (meas_counter%1000==0) { usleep(delay); delay+=250000; } do_filtering }

Sure enough, deliberately putting in an extra delay causes the problem. The longer the delay the more the problem. I also started monitoring temperature. The temperature drops along with the pressure. I do NOT think this is a real thermal issue. I think the ADC's reference is probably moving a bit overtime, and that it's important to exercise it at regular intervals in order to keep it equally charged. My guess is that sample rate itself will essentially cause an offset, but so long as the sample rate is constant it probably doesn't result in a huge problem.

It occurred to me that the output movement was caused by the temperature drop. Turns out, both pressure and temperature move. I tried holding temperature constant if there was a noticeable drop and it resulted in a positive offset instead of a negative.

The good news is that with the exception of this issue, the temperature reading is very stable. I suspect this means you could use the temperature information to A) detect when the problem is gone (i.e. the temperature recovers to the pre-timing glitch level), or B) possibly to actually compensate (i.e. if post-glitch D2 < pre-glitch D2 then multiply D1 by pre-glitch D2/current D2). This isn't a perfect solution because there's a time delay between D1 and D2 measurements, but you might be able to use the other sensor's D1 since that should be nearly contemporaneous. But it's definitely worth checking. Will come back to this.

I am now attempting to determine if the timing glitch is caused by writing data. I'm running for several hours and only writing data if there is a timing glitch. Hopefully when I kill the job it will flush the cache

mihu-ov commented 4 years ago

Looking at the schematic, it lools like you might be able to just wire up an Arduino to the back of it and have the Arduino do all the data collection, interrupt driven..

Timo built a tiny microprocessor board that plugs in between sensorboard and adapterboard a few years ago. Today, maybe with https://github.com/qubolino/flybaby we could make use of the MPU9150 as well?

hpmax commented 4 years ago

John Wells (jfwells) made a variant of sensord which incorporates the MPU9150. Honestly though, for a quick hack, running an Arduino out the I2C interface seems like the simplest solution. A good solution would be to design a new sensorboard that incorporated something with guaranteed timing (CPLD or FPGA perhaps?) I suppose an Arduino is good enough, but an FPGA might reduce timing noise a little.

That said, feast your eyes on this: compensation

What I did here was to deliberately inject delay (as described in the previous post). What you are looking at in orange is the D1 trace on the TE sensor after the delay (i.e. D1Ti). On the blue trace, I did D1Ti + (D2Si-D2S0)*K, where D2S0 is the last temperature reading prior to the glitch and K was 0.297

K was chosen to minimize the standard deviation. It's possible this is something that would need to be calibrated from one sensor to another, and it's also possible it temperature dependent... I did check a different time delay (that graph was from 190ms), I also looked at 92ms. I'm happy to report the compensation was pretty similarly effective.

In other news, after changing the data logging to only output data if the delay was > 25ms (I had adjusted the delay to 20ms, and actually did some other stuff to improve timing accuracy, although it would NOT fix a problem if there was an excessive delay when logging data, I wound up with the following statistics (after running for several hours):

Minimum delay: 20.00046 ms Max delay: 44.273 ms Mean delay: 24 ms Std Deviation: 4.3 ms (kind of meaningless since it's clearly not Gaussian)

hist

I'm going to try to incorporate the compensation into sensord and see what happens.

hpmax commented 4 years ago

As a side note, I went back through the datalog that produced the data that I compensated and took a look at D2 as a function of time: bigview

You can clearly see the the D1 (pressure) spikes that were created by the injected delay, and their magnitude increases as the delay increases, at least up until a point where it seems like it doesn't any more. But let's take a look at that area in the center with the smaller spikes which were some other delay that I didn't deliberately add (I presume this is "disk" access):

zoomed

The deliberate spikes are the ones at 6000 and 6500. The others were at:

6112: 21.4ms 6123: 107ms 6175: 22.2ms 6217: 21.2ms 6237: 21.7ms 6287: 25ms 6362: 25.4ms 6382: 21.4ms 6391: 139ms

See if you can see each one in the data... The 21ms delays are hard to see, but I can make them out. The 25ms ones are readily visible.

Here's the resulting TE pressure: te6000

Here's the resulting static pressure: static6000

Looking at these, it's hard for me to see the 21ms delays -- but I think they are there. However, I can make out the 25ms delays.

At this point, my conclusion is that these sensors can potentially detect somewhere between .25 and .5 ms jitter in the timing output, and looking at the raw values. you can do a lot better than that.

iglesiasmarianod commented 4 years ago

I've been reading all your work, it is great. Thanks! I wondered about the use of a microcontroller to drive sensorboard and a few questions hited me. 1) Quick and dirty we can make an intermediate board connecting to the IDC cable between adapterboard and sensorboard. The needed signals can be processed there, audio and power can pass through untouched. 2) We will have firmware for that micro. How can we update that firmware once the board is installed? Is it possible to update it from the cubieboard itself? This will make it a lot easier for non PC guys. I think adapterboard has two unused GPIO ports. 3) Backwards compatibility. If we modify connection between sensorboard and adapterboard, is there a way to keep it compatible with unmodified OpenVarios? This would ensure that newer versions of the software can be used without hassle on older hardware. I can help with board design too.

hpmax commented 4 years ago

Improved sampling scheme, but no compensation for timing jitter:

Note: Blue is TE, Orange is static, but the behavior is arbitrary and solely dependent on when the noise jitter occurs. My deliberate delay is designed to have the biggest impact on TE, but it's essentially completely reciprocal.

uncompensated

Different data set, but similar timing jitter, with compensation: compensated

I have added timing noise to the other half, and am torture testing it now with more randomize delays... Will run it overnight, but so far it's holding up great!

As for a hardware solution (if you still think it's necessary), I think you could probably hang an Arduino off I2C RJ45 port in the back. The only thing, I'm unclear on would be could the AVR feed the Cubieboard asynchronously. Essentially, we'd want the AVR to be the master, and the Cubieboard to be the slave.

linuxianer99 commented 4 years ago

I agree with the hardware based solution. I also did some project in the past, but skipped this approach because of more important openvario things ... ;-)

But a short summary:

Option1: Have a uC board internal connected between the IDC from Mainboard to sensorboard.

Option2: Use sensorboard with additional uC board as external device interfacing the RJ45 IGC connector

I would prefer option 2 !! Using an ESP32 (which has much processor power Dual Core!!) and a nice ecosystem in regard to software libs, etc ... It is very cheap also ...

Last time i started with option 1, but as the time beeing ... option 2 is the better one now ;)

Any suggestions, comments ??

linuxianer99 commented 4 years ago

another thing:

maybe it make sense to concentrate the effort on the hardware solution instead of improving sensord timing ? @hpmax : So you did a lot regarding timing, filter etc ... If you can use an ESP32 prototype board for software, i can do the final hardware in parallel .... Sounds this like a plan for you ???

hpmax commented 4 years ago

Honestly, at this point, I'm not sure a hardware solution is necessary. I'm pretty happy with what I have with just software tweaks. I still need to do more testing, and a little cleanup, but I think I have something mostly working. I should have something by the weekend. And there are probably a few other tweaks I could do to improve it further. I was hammering it hard with a lot of timing jitter. It's not actually going to experience anything like this in real life.

latest

Note: I hope this is accurate I did find it disturbing that the traces seemed to be getting closer to one another over time. That was about 52 megabytes of data. SD card filled up and it stopped recording after that.

Admittedly, a hardware solution would unquestionably be better, but I think the best thing to do would be a sensorboard re-design and that would include ditching the MPU9150 for a modern device, and hopefully using it. And while we are at it, is the MS5611 the best part out there? I can't tell how it compares to Bosch's BMP388 for instance.

As for a quick and dirty solution. I would think the easiest thing to do would be to wire an Arduino into the I2C port in the back. The only thing I can think of that would be challenging is that I think you'd need to turn the Cubieboard into a bus slave and have the AVR initiate communications. I assume I2C allows only one master, otherwise you could get contention?

As for an ESP32. I'm not familiar with them, but they seem decent. I am concerned about processing power if the objective is to offload everything onto it from the Cubieboard...It would be appealing to have effectively an IMU that everything plugs into, which has MEMS sensors as well as the pressure sensors and GPS plugged into it... Spit out NMEA to the Cubieboard, and possibly to external devices so it could drive a vario. If it's generating audio it would be nice to not do that in the main processor, but I'm not sure that's avoidable.

Also, is any of this going to require change to xcsoar?

I'll go along with whatever, but for now, I'm going to finish my software only solution. It's almost there. Plus, I am hoping to integrate an improved Kalman filter within the next week or two.

iglesiasmarianod commented 4 years ago

I believe the software upgrade gives us time to develop a hardware solution and is the way to go.

As for the hardware solution I thought about IDC cable option and came up with this:

1) Cubieboard is not going to use I2C anymore so perhaps we can change the DTS file and convert the I2C port into a serial port (TTyS4) making the IDC option easier. Like having an RJ45 IGC port internally. 2) Perhaps we can keep I2C port for backwards compatibility and use one of the other two adapterborad has as TTyS4 giving users the option to upgrade or not in a simple way. Gerbers have the holes for them but no connector is soldered there. This will require soldering the IDC connector.

I modified my sensorboard to use MP9250 as IMU sensor. Don't know if you think it's a good choice.