Safecast / bGeigieNanoKit

bGeigieNano is a kit version of the bGeigie mobile survey geiger counter designed to fit into a Pelican Micro Case 1010.
https://safecast.org/devices/bgeigie-nano/
113 stars 43 forks source link

corruption in the date portion of log files #40

Open thinrope opened 8 years ago

thinrope commented 8 years ago

Looking at the raw log files from Nanos, in many places there is the invalid date string of "2000-00-00T" ... About 244K lines of 40.5M or 0.6% is corrupt. It is present in a variety of firmware versions (numbers are number of on-off log segments), showing top 3 of 17 affected versions:

format=1.3.5nano,207
format=1.2.8nano,127
format=1.3.4nano,118

It affects 99 devices, some more than others, those are top 10:

2140,25
2327,20
2004,11
2022,11
2303,11
2431,11
2001,8
1009,7
2320,7
2326,7

I looked at the code, but couldn't spot anything obvious and no literal string 2000. However, there is quite a lot of "magic hackery" with the years, e.g.: https://github.com/Safecast/bGeigieNanoKit/blob/8559e91d1eb2793db86053bf9c1ec2a5935b57e2/bGeigieNano.ino#L78 https://github.com/Safecast/bGeigieNanoKit/blob/8559e91d1eb2793db86053bf9c1ec2a5935b57e2/bGeigieNano.ino#L852 And finally https://github.com/Safecast/bGeigieNanoKit/blob/8559e91d1eb2793db86053bf9c1ec2a5935b57e2/TinyGPS.cpp#L422

Recent drives with this problem (from May 2016) are 22969,22975,22980 (devices 1207 and 2001). Both run "format=1.3.4nano" firmware. For those 3 drives, this is the number of points on a given date:

1987-02-00  7
1987-05-00  43
1987-28-00  14
2000-00-00  462
2003-00-00  42
2004-00-00  13
2006-00-00  38
2010-00-00  2
2016-03-27  4771
2016-05-01  1001
2016-05-03  415
2016-05-04  3364
2016-05-05  4213
2030-16-00  42
2035-04-00  45
2051-05-00  10
2052-00-00  21
2061-00-00  55
2080-01-05  5
2080-01-06  11
2080-01-10  3

While there is a higher percentage of 2000-00-00 bug, it really smells like memory corruption to me.

thinrope commented 8 years ago

Just looking at the dates, 40325908 lines were extracted, 475066 of them are invalid dates (including before 3/11 and after today), or that gives 1.17% of corrupt data only due to date problems. Looking at the top number of invalid dates, all have "00-00" in the month-day section and those represent 446908 or 94% of the bad dates by number.

So may be not memory corruption after all, but bad logic somewhere :-|

fakufaku commented 8 years ago

Hi Kalin, part of the explanation (the last magic) is that the GPS year is given on two digits (god know why) that start at 80 (1980). Some of the GPS modules default to 80 before they first acquire a date. I am guessing that another part of the explanation is that different modules default to different year (just maybe). For the rest, maybe a memory leak ? I wouldn't be surprised given how packed the firmware is.

thinrope commented 8 years ago

Yes, I know the NMEA mis-design ;-) I also thought something like memory corruption, but seeing that 94% of the cases we end up with 00 in month/day, I am 94% sure it is a logical mistake, possibly not checking error or something, since we initialize those values to 0, before passing by reference. If I had to debug that, I'd initialize them to 99 or 33 instead and look to catch such value, as well as 00.