ArduPilot / ardupilot

ArduPlane, ArduCopter, ArduRover, ArduSub source
http://ardupilot.org/
GNU General Public License v3.0
10.46k stars 17.1k forks source link

ArduCopter: `main_loop` keeps waiting for other thread to release semaphore #24942

Open sahilsihag opened 12 months ago

sahilsihag commented 12 months ago

Bug report

Issue summary
io_thread can be forced to obtain a semaphore and not release it for long time. Meanwhile main_loop thread might also require the same semaphore, in this case ChibiOS puts main_loop into sleeping state (chSchGoSleepS(CH_STATE_WTMTX)) while waiting for io_thread to release semaphore.

Issue details

MAVLink packet: fd0400003effe68700002186678f448a

Information:
Payload: TERRAIN_CHECK (135)
    lat (int32_t): -1889040863
    lon (int32_t): 0

leads to case DiskIoWaitRead (libraries/AP_Terrain/TerrainIO.cpp).

    case DiskIoWaitRead:
        // need to read in the block
        open_file();
        if (fd == -1) {
            return;
        }
        read_block();
                                         // inside `read_block()`
                                         // `AP::FS().lseek(fd, file_offset, SEEK_SET)` <-- tries to lseek to file_offset (calculated from latitude, longitude)
                                             // this calls `AP_Filesystem_FATFS::lseek` and it obtains semaphore sem `WITH_SEMAPHORE(sem);

After obtaining the semaphore sem, it calls f_lseek (ardupilot/modules/ChibiOS/ext/fatfs/source/ff.c) where the offset (ofs) value is 280303616 and cluster size (bcs) is 4096. Because of the high offset value, large amount of time is consumed before exiting out of the while loop in this function.

                while (ofs > bcs) {                     /* Cluster following loop */
                    ofs -= bcs; fp->fptr += bcs;
#if !FF_FS_READONLY
                    if (fp->flag & FA_WRITE) {          /* Check if in write mode or not */
                        if (FF_FS_EXFAT && fp->fptr > fp->obj.objsize) {    /* No FAT chain object needs correct objsize to generate FAT value */
                            fp->obj.objsize = fp->fptr;
                            fp->flag |= FA_MODIFIED;
                        }
                        clst = create_chain(&fp->obj, clst);    /* Follow chain with forceed stretch */
                        if (clst == 0) {                /* Clip file size in case of disk full */
                            ofs = 0; break;
                        }
                    } else
#endif
                    {
                        clst = get_fat(&fp->obj, clst); /* Follow cluster chain if not in write mode */
                    }
                    if (clst == 0xFFFFFFFF) ABORT(fs, FR_DISK_ERR);
                    if (clst <= 1 || clst >= fs->n_fatent) ABORT(fs, FR_INT_ERR);
                    fp->clust = clst;
                }

Therefore, we have an io_thread which obtains semaphore sem and does not leave it for a long time. In the mean time, we can request log list with the help of another packet:

MAVLink packet: fd0500003fffe67500000000ffff01ce67

Information:
Payload: LOG_REQUEST_LIST (117)
    target_system (uint8_t): 1
    target_component (uint8_t): 0
    start (uint16_t): 0
    end (uint16_t): 65535

This forces main_loop to obtain semaphore sem for calculating number of available logs. Since the semaphore is already owned by another thread, we trigger priority inheritance protocol of ChibiOS and put main_loop into sleeping state (chSchGoSleepS(CH_STATE_WTMTX)) for the mean time.

Additional Comments

  1. Card speed: The time spent in while loop is dependent on microSD card speed. Speed class 4 takes a while and newer microSD cards take less time. But offset value can also be increased to a bigger number to incorporate new cards.
  2. Requesting log list is only possible when the drone is not armed. I have not exhaustively searched for operations in main_loop which try to obtain same semaphore while in flight.
    • In general, what is the behavior when a thread does not release a semaphore that is required by main_loop?
  3. Additional threads such as FTP that require same semaphore also stop responding.
  4. OmnibusF4 restarts when such situation occurs. I am not sure if the response is same on all boards.

Version
ArduCopter V4.4.0 (502702df)

Platform
[ ] All [ ] AntennaTracker [x] Copter -> with microSD card (FATFS) [ ] Plane [ ] Rover [ ] Submarine

Airframe type
Not applicable.

Hardware type
omnibusF4 with microSD card (FATFS)

Logs
Mavproxy logs:

Init ArduCopter V4.4.0 (502702df) ...

sahilsihag commented 11 months ago

Correction: Packet 2 for log list was incorrect. Updated with correct MAVLink packet fd0500003fffe67500000000ffff01ce67.

tridge commented 8 months ago

@sahilsihag the TERRAIN_CHECK will only pull from the in-memory cache. The filesystem layer in ArduPilot refuses any file operations from the main thread when armed. When disarmed you can do file ops, but even there the AP_Terrain/TerrainIO.cpp separates out disk ops to not happen from the main thread

rmackay9 commented 8 months ago

We should probably close this issue then I think

sahilsihag commented 8 months ago

@tridge I agree that the current situation is only possible when the drone is not armed (See Additional Comments: point 2). But I will still like to have discussion about the underlying problem: the interplay between main thread and another thread for shared resources.

When terrain check does not find things inside in-memory cache, it is possible to lead it to DiskIoWaitRead. In this case separate IO thread takes care of reading files from disk. In the issue described above:

  1. IO thread can take long time to read from disk
  2. Meanwhile when log list is requested, main thread keeps waiting for IO thread to finish its work (see priority inheritance protocol of ChibiOS). Therefore, we can starve main thread when the drone is not armed and cause firmware restarts repeatedly on ground.

While this one example is dependent on file system operation (and not a problem while device is armed), this situation can also arise in other cases due to priority inheritance protocol. My main question is: What saves the main thread from becoming a victim in other similar situations?

tridge commented 8 months ago

@sahilsihag I don't think it even happens when disarmed. Do you have a gdb backtrace showing it calling an IO function (open/read/write) from the main thread? The functions called by the main thread just schedule IO, but don't actually do it

tridge commented 8 months ago

note that if we did have a bug like this then as soon as you armed we'd get an internal error and very loud shouting in the log, as we tell the pilot about an attempt to do IO from the main thread when armed

rmackay9 commented 8 months ago

I think I might go ahead and close this. We will still see replies if there are any though. Sorry to be annoying but I think this issue is really more of a question and not something that we've actually seen in practice.

sahilsihag commented 8 months ago

@sahilsihag I don't think it even happens when disarmed. Do you have a gdb backtrace showing it calling an IO function (open/read/write) from the main thread? The functions called by the main thread just schedule IO, but don't actually do it

This issue is only possible when disarmed but it does happen.

IO thread backtrace (obtaining semaphore sem) libraries/AP_Filesystem/AP_Filesystem_FATFS.cpp:291:

(gdb) bt
#0  AP_Filesystem_FATFS::open (this=<optimized out>, pathname=0x10008f68 "/APM/TERRAIN/N67E000.DAT", flags=514, allow_absolute_path=<optimized out>) at ../../libraries/AP_Filesystem/AP_Filesystem_FATFS.cpp:291
#1  0x08076728 in AP_Filesystem::open (this=<optimized out>, fname=<optimized out>, flags=flags@entry=514, allow_absolute_paths=allow_absolute_paths@entry=false) at ../../libraries/AP_Filesystem/AP_Filesystem.cpp:122
#2  0x0805e8b6 in AP_Terrain::open_file (this=this@entry=0x200054f4 <copter+12372>) at ../../libraries/AP_Terrain/TerrainIO.cpp:203
#3  0x0805ec88 in AP_Terrain::io_timer (this=0x200054f4 <copter+12372>) at ../../libraries/AP_Terrain/TerrainIO.cpp:372
#4  0x0805eca2 in Functor<void>::method_wrapper<AP_Terrain, &AP_Terrain::io_timer> (obj=<optimized out>) at ../../libraries/AP_HAL/utility/functor.h:88
#5  0x080ab4aa in Functor<void>::operator() (this=0x2000c6b0 <schedulerInstance+144>) at ../../libraries/AP_HAL/utility/functor.h:54
#6  ChibiOS::Scheduler::_run_io (this=this@entry=0x2000c620 <schedulerInstance>) at ../../libraries/AP_HAL_ChibiOS/Scheduler.cpp:508
#7  0x080ab6f2 in ChibiOS::Scheduler::_io_thread (arg=0x2000c620 <schedulerInstance>) at ../../libraries/AP_HAL_ChibiOS/Scheduler.cpp:532
#8  0x08011406 in _port_thread_start ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

IO thread already has semaphore sem and now main tries to obtain it. Backtrace for main (trying to obtain semaphore sem) libraries/AP_Filesystem/AP_Filesystem_FATFS.cpp:695:

(gdb) bt
#0  AP_Filesystem_FATFS::opendir (this=<optimized out>, pathdir=0x80cf6f0 "/APM/LOGS") at ../../libraries/AP_Filesystem/AP_Filesystem_FATFS.cpp:695
#1  0x080768a2 in AP_Filesystem::opendir (this=<optimized out>, pathname=<optimized out>) at ../../libraries/AP_Filesystem/AP_Filesystem.cpp:198
#2  0x080630be in AP_Logger_File::get_num_logs (this=0x20019ae0) at ../../libraries/AP_Logger/AP_Logger_File.cpp:698
#3  0x0806120c in AP_Logger::get_num_logs (this=this@entry=0x20004b04 <copter+9828>) at ../../libraries/AP_Logger/AP_Logger.cpp:818
#4  0x080a33ec in AP_Logger::handle_log_request_list (this=this@entry=0x20004b04 <copter+9828>, link=..., msg=...) at ../../libraries/AP_Logger/AP_Logger_MAVLinkLogTransfer.cpp:78
#5  0x080a375e in AP_Logger::handle_log_message (this=0x20004b04 <copter+9828>, link=..., msg=...) at ../../libraries/AP_Logger/AP_Logger_MAVLinkLogTransfer.cpp:49
#6  0x08061288 in AP_Logger::handle_mavlink_msg (this=0x20004b04 <copter+9828>, link=..., msg=...) at ../../libraries/AP_Logger/AP_Logger.cpp:844
#7  0x0806cb24 in GCS_MAVLINK::handle_common_message (this=this@entry=0x2001afc0, msg=...) at ../../libraries/GCS_MAVLink/GCS_Common.cpp:3806
#8  0x0801511e in GCS_MAVLINK_Copter::handleMessage (this=0x2001afc0, msg=...) at ../../ArduCopter/GCS_Mavlink.cpp:1421
#9  0x0806b56e in GCS_MAVLINK::packetReceived (this=0x2001afc0, status=..., msg=...) at ../../libraries/GCS_MAVLink/GCS_Common.cpp:1662
#10 0x08013e6a in GCS_MAVLINK_Copter::packetReceived (this=<optimized out>, status=..., msg=...) at ../../ArduCopter/GCS_Mavlink.cpp:605
#11 0x0806a786 in GCS_MAVLINK::update_receive (this=0x2001afc0, max_time_us=max_time_us@entry=1000) at ../../libraries/GCS_MAVLink/GCS_Common.cpp:1713
#12 0x0806d30a in GCS::update_receive (this=0x20004d50 <copter+10416>) at ../../libraries/GCS_MAVLink/GCS_Common.cpp:2372
#13 0x08012876 in Functor<void>::method_wrapper<GCS, &GCS::update_receive> (obj=<optimized out>) at ../../libraries/AP_HAL/utility/functor.h:88
#14 0x0805d396 in Functor<void>::operator() (this=0x80c1028 <Copter::scheduler_tasks+740>) at ../../libraries/AP_HAL/utility/functor.h:52
#15 AP_Scheduler::run (this=this@entry=0x200024f8 <copter+88>, time_available=2136) at ../../libraries/AP_Scheduler/AP_Scheduler.cpp:255
#16 0x0805d8a6 in AP_Scheduler::loop (this=this@entry=0x200024f8 <copter+88>) at ../../libraries/AP_Scheduler/AP_Scheduler.cpp:383
#17 0x0805f8b4 in AP_Vehicle::loop (this=0x200024a0 <copter>) at ../../libraries/AP_Vehicle/AP_Vehicle.cpp:308
#18 0x0808f720 in main_loop () at ../../libraries/AP_HAL_ChibiOS/HAL_ChibiOS_Class.cpp:289
#19 0x0808f760 in HAL_ChibiOS::run (this=<optimized out>, argc=<optimized out>, argv=<optimized out>, callbacks=0x2000c7b8 <AP_HAL::get_HAL()::hal_chibios>) at ../../libraries/AP_HAL_ChibiOS/HAL_ChibiOS_Class.cpp:341
#20 0x080136e8 in main (argc=<optimized out>, argv=<optimized out>) at ../../ArduCopter/Copter.cpp:797
#21 0x08011396 in endinitloop ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
sahilsihag commented 8 months ago

note that if we did have a bug like this then as soon as you armed we'd get an internal error and very loud shouting in the log, as we tell the pilot about an attempt to do IO from the main thread when armed

The board will restart and we will see:

AP: WDG: T41 SL695 FL0 FT0 FA0 FTP0 FLR0 FICSR0 MM117 MC0 IE32768 IEC15 TN:
...
AP: Internal Errors 0x800
...
tridge commented 8 months ago

thanks for the backtraces! that makes things much clearer