PX4 / PX4-Autopilot

PX4 Autopilot Software
https://px4.io
BSD 3-Clause "New" or "Revised" License
8.39k stars 13.46k forks source link

Longer logging ends in error of FAT32 filesystem, Corrupted SD card #18672

Closed roman-dvorak closed 2 years ago

roman-dvorak commented 2 years ago

Hi, we repeatedly have a problem with logging to the SD card. It happened to us several times that the card was corrupted during the inflight logging. The SD card or filesystem during read by computer seems to be flagged as read-only. And the log file was not complete.

For the first time, it happened to us on 7. 10. 2021 with takeoff mode (#18582), which was based on a few days old PX4 master version. At that time, I thought it was due to a large number of debug text logs (PX4_INFO("...")). In the meantime, we tried to use a new (32GB SD card)[], format the card (completely erased) to FAT32. And the problem was repeated several times. Therefore we didn't have a log that we really needed.. It does not matter that we create a smaller 4GB FAT32 volume on SDcard. The situation was almost the same, corrupted filesystem with missing logs and sometimes other files which were previously on SDcard.

Next time, I limited number of PX4_INFO messages to the minimum number of these logs. And the SD card was corrupted again. With such FW. But it was the Firmware with my modifications.

The first used card was: obrazek

After a debate in (#18419), we decided to test this “airspeed mode” PR with our second autogyro (FMU v5, CUAV NANO) and this was almost unmodified FW (f814a52). But during the flight, the SDcard filesystem corruption happened again. This piece of autogyro (the same autopilot, the same SDcard, ..) was previously used in the same configuration with several months old master FW. Without any signs of this problem.

This always happened after a few seconds of flight. I don’t see any clear cause for this behaviour in the same fragments of the logs, retrieved from SDcard after fsck recovery, containing the takeoff

obrazek obrazek

I tried to repeat this on the ground. But I was unsuccessful. Logging worked for a very long time.

From these experiments, I conclude that this is a problem in Firmware (I don’t know at what level).

During the first attempts (when this issue occurred) I used the internal reader of my laptop. During the next experiments, I used an external uSD card reader from Kingstone. But the result was the same.

dagar commented 2 years ago

Thanks for the detailed report. We're currently investigating this as a blocking issue for the next release (v1.13).

roman-dvorak commented 2 years ago

Hi, I've been trying to make more attempts to get to the state where you can log.

  1. I turned on the complete UAV assembly. Pressed safety switch and I forced arm through QGC command line.
  2. Logging on (confirmed by changing log file size)
  3. I waited a while and the logfile stopped growing. It took only several minutes.

Subsequently, in nuttx dmesg was: obrazek

In this state, it is not possible to restart logger: obrazek It writes that it only records the mission log.

I was able to repeat this several times (4x). Some logs could be uploaded to review.px4.io and some not (without any error message). However, all could be displayed in the plotjuggler. Some I was unable to get due to a corrupted SD card.

log files, that ended "in-flight" (in armd state). 15_22_37.zip 15_41_42.zip

Top (in time of corrupted logger): obrazek obrazek

There is an high CPU usage. By log_writer_file.

Sometimes I was not able to connect the card to the computer (ubuntu 20.04). With these errors: obrazek obrazek

Tested with these cards: obrazek

obrazek obrazek

dagar commented 2 years ago

@roman-dvorak we may have a fix, could you please try https://github.com/PX4/PX4-Autopilot/pull/18687?

To be paranoid and convince myself the card and file system is in good shape I've been exhaustively testing all blocks. https://gist.github.com/dagar/8d002760ee1f8e76d1bd88c16a469d24#file-linux_sdcard_test_and_format-sh-L8

roman-dvorak commented 2 years ago

Booth SD cards look well. It passed the test without any error. Without any bad sector.

SanDisc Ultra 16 GB:

Object /org/freedesktop/UDisks2/block_devices/sdf is not a mountable filesystem.

*****  BADBLOCK *****

Hledají se špatné bloky v režimu čtení i zápis
Od bloku 0 do 15558143
Zkouším s náhodným vzorkem: hotovo                                                 
Čtení a porovnání: hotovo                                                 
Průchod dokončen, nalezeno 0 špatných bloků (0/0/0 chyb).

6.42user 15.65system 25:18.85elapsed 1%CPU (0avgtext+0avgdata 2672maxresident)k
31118848inputs+31116288outputs (12major+147minor)pagefaults 0swaps

*****  F3PROBE *****

F3 probe 7.2
Copyright (C) 2010 Digirati Internet LTDA. This is free software; see the source for copying conditions.

WARNING: Probing normally takes from a few seconds to 15 minutes, but it can take longer. Please be patient.

Good news: The device `/dev/sdf' is the real thing

Device geometry:
             *Usable* size: 14.84 GB (31116288 blocks)
            Announced size: 14.84 GB (31116288 blocks)
                    Module: 16.00 GB (2^34 Bytes)
    Approximate cache size: 0.00 Byte (0 blocks), need-reset=no
       Physical block size: 512.00 Byte (2^9 Bytes)

Probe time: 2'29"
 Operation: total time / count = avg time
      Read: 829.0ms / 4814 = 172us
     Write: 2'27" / 4192321 = 35us
     Reset: 0us / 1 = 0us

0.95user 0.67system 2:29.60elapsed 1%CPU (0avgtext+0avgdata 3404maxresident)k
5382inputs+4192321outputs (3major+362minor)pagefaults 0swaps

and the second SD (SanDisc Ultra 32 GB):

*****  BADBLOCK *****

Hledají se špatné bloky v režimu čtení i zápis
Od bloku 0 do 31166975
Zkouším s náhodným vzorkem: hotovo                                                 
Čtení a porovnání: hotovo                                                 
Průchod dokončen, nalezeno 0 špatných bloků (0/0/0 chyb).
8.24user 19.18system 32:30.84elapsed 1%CPU (0avgtext+0avgdata 3052maxresident)k
62334736inputs+62333952outputs (6major+157minor)pagefaults 0swaps

*****  F3PROBE *****

F3 probe 7.2
Copyright (C) 2010 Digirati Internet LTDA.
This is free software; see the source for copying conditions.

WARNING: Probing normally takes from a few seconds to 15 minutes, but
         it can take longer. Please be patient.

Good news: The device `/dev/sdf' is the real thing

Device geometry:
             *Usable* size: 29.72 GB (62333952 blocks)
            Announced size: 29.72 GB (62333952 blocks)
                    Module: 32.00 GB (2^35 Bytes)
    Approximate cache size: 0.00 Byte (0 blocks), need-reset=no
       Physical block size: 512.00 Byte (2^9 Bytes)

Probe time: 1'18"
 Operation: total time / count = avg time
      Read: 710.5ms / 4815 = 147us
     Write: 1'16" / 4192321 = 18us
     Reset: 0us / 1 = 0us
0.53user 0.34system 1:18.14elapsed 1%CPU (0avgtext+0avgdata 3536maxresident)k
4815inputs+4192321outputs (0major+359minor)pagefaults 0swaps

mkfs.fat 4.1 (2017-01-24)
Auto-selecting FAT32 for large filesystem
/dev/sdg1 has 64 heads and 32 sectors per track,
hidden sectors 0x0800;
logical sector size is 512,
using 0xf8 media descriptor, with 62330880 sectors;
drive number 0x80;
filesystem has 2 32-bit FATs and 32 sectors per cluster.
FAT size is 15232 sectors, and provides 1946887 clusters.
There are 32 reserved sectors.
Volume ID is 1b7d3c73, no volume label.
Searching for bad blocks 203360... 423136... 634720... 841856... 1058784... 1276128... 1482976... 1702624... 1922784... 2138080... 2343904... 2563936... 2783968... 2988000... 3202272... 3421024... 3629792... 3845600... 4065120... 4280800... 4485216... 4703712... 4923104... 5132256... 5344224... 5564000... 5778912... 5981280... 6197856... 6418528... 6628704... 6839168... 7058400... 7274720... 7478496... 7697760... 7916256... 8122848... 8335200... 8553184... 8768224... 8979424... 9198176... 9416544... 9620832... 9839712... 10060640... 10270944... 10479456... 10699232... 10914784... 11116768... 11334112... 11554528... 11763808... 11973088... 12193632... 12411232... 12614112... 12834144... 13054304... 13263328... 13474272... 13694432... 13912288... 14120928... 14341088... 14560480... 14766304... 14984800... 15204576... 15416544... 15625312... 15845600... 16064736... 16267616... 16484064... 16704608... 16916064... 17122656... 17342048... 17562208... 17765984... 17982176... 18201824... 18409952... 18619360... 18837856... 19055328... 19262944... 19481952... 19701600... 19907552... 20124768... 20344544... 20557792... 20766176... 20985696... 21204832... 21408736... 21619936... 21832928... 22046816... 22253664... 22473824... 22695232... 22899936... 23114720... 23334112... 23546656... 23752288... 23969760... 24178016... 24389728... 24603488... 24822752... 25035360... 25246176... 25464672... 25683936... 25888480... 26104160... 26323936... 26534112... 26740832... 26960608... 27179616... 27395296... 27610080... 27829472... 28042848... 28249568... 28468832... 28688480... 28893280... 29109728... 29328096... 29536224... 29741280... 29960800... 30178784... 30386272... 30602720... 30821472... 31029344... 
3.57user 33.08system 12:09.81elapsed 5%CPU (0avgtext+0avgdata 2124maxresident)k
62331040inputs+30528outputs (2major+2009minor)pagefaults 0swaps

Model: Neznámý (unknown)
Disk /dev/sdg1: 31,9GB
Sector size (logical/physical): 512B/512B
Tabulka oddílů: loop
Disk Flags: 

Číslo  Začátek  Konec   Velikost  Systém souborů  Přepínače
 1     0,00B    31,9GB  31,9GB    fat32

fsck.fat 4.1 (2017-01-24)
Checking we can access the last sector of the filesystem
Boot sector contents:
System ID "mkfs.fat"
Media byte 0xf8 (hard disk)
       512 bytes per logical sector
     16384 bytes per cluster
        32 reserved sectors
First FAT starts at byte 16384 (sector 32)
         2 FATs, 32 bit entries
   7798784 bytes per FAT (= 15232 sectors)
Root directory start at cluster 2 (arbitrary size)
Data area starts at byte 15613952 (sector 30496)
   1946887 data clusters (31897796608 bytes)
32 sectors/track, 64 heads
      2048 hidden sectors
  62330880 sectors total
Starting check/repair pass.
Checking for bad clusters.
Reclaiming unconnected clusters.
Checking free cluster summary.
Starting verification pass.
Checking for unused clusters.
/dev/sdg1: 0 files, 1/1946887 clusters
3.32user 25.88system 11:59.52elapsed 4%CPU (0avgtext+0avgdata 32112maxresident)k
62362336inputs+0outputs (4major+13403minor)pagefaults 0swaps

sd_bench and sd_stress pased multiple times:

nsh> 
nsh> 
nsh> sd_stress
iteration 0 took 2369326 us: OK
iteration 1 took 3534981 us: OK
iteration 2 took 2271921 us: OK
iteration 3 took 2345547 us: OK
iteration 4 took 2330682 us: OK
nsh> 
nsh> sd_bench -u -v
INFO  [sd_bench] Using block size = 4096 bytes, sync=0
INFO  [sd_bench] 
INFO  [sd_bench] Testing Sequential Write Speed...
INFO  [sd_bench]   Run  0:  2032.37 KB/s, max write time: 12 ms (= 333.33 KB/s), fsync: 7 ms
INFO  [sd_bench]   Run  1:  1874.72 KB/s, max write time: 67 ms (=  59.70 KB/s), fsync: 7 ms
INFO  [sd_bench]   Run  2:  2014.84 KB/s, max write time: 49 ms (=  81.63 KB/s), fsync: 12 ms
INFO  [sd_bench]   Run  3:  1846.66 KB/s, max write time: 67 ms (=  59.70 KB/s), fsync: 9 ms
INFO  [sd_bench]   Run  4:  1964.14 KB/s, max write time: 47 ms (=  85.11 KB/s), fsync: 54 ms
INFO  [sd_bench]   Avg   :  1946.65 KB/s
INFO  [sd_bench] 
INFO  [sd_bench] Testing Sequential Read Speed of 4913 blocks
INFO  [sd_bench]   Run  0:  2970.85 KB/s, max read/verify time: 7 ms (= 571.43 KB/s)
INFO  [sd_bench]   Run  1:  2891.55 KB/s, max read/verify time: 5 ms (= 800.00 KB/s)
INFO  [sd_bench]   Run  2:  2962.37 KB/s, max read/verify time: 5 ms (= 800.00 KB/s)
INFO  [sd_bench]   Run  3:  3083.00 KB/s, max read/verify time: 2 ms (=2000.00 KB/s)
INFO  [sd_bench]   Avg   :  2955.36 KB/s 4913 blocks read and verified
nsh> 

I merged your PR (#18687) into autogyro takeoff PR (#18582), which resulted into a commit: https://github.com/ThunderFly-aerospace/PX4Firmware/commit/252e4e94f6e07d8d155a628cd67d1b08c8525398. Everything looks good during these bench tests. But I can't confirm it, right now. I tried to do the same operations as in the previous attemt. And with each attempt, a complete log was created. In order to confirm the functionality of these improvements, we must do some flight-test in real enviroment.

roman-dvorak commented 2 years ago

Thanks! I can confirm that this problem did not occur during the flight test.