TwinFan / LiveTraffic

LiveTraffic is an X-Plane multiplayer plugin, which fills your sky with live air traffic based on public flight tracking data.
https://twinfan.gitbook.io/livetraffic/
Other
99 stars 24 forks source link

CTD at KDEN in DataCleansing / GetToPos #207

Closed TwinFan closed 3 years ago

TwinFan commented 3 years ago

Describe the bug One user reports reoccuring crashes when nearing or being at KDEN, standard X-Plane scenery. In the crash dumps he provided two patterns are visible, but they are just one line apart in the source code, so I treat them for the moment as related.

At the same time, other provided crash dumps point into X-Plane. We should not completely forget about them because what we see here is quite weird...not sure what happened to our memory. And maybe sometimes it happens to X-Plane's memory?

To Reproduce Have LiveTraffic run at KDEN. Channels involved are OpenSky and Open Glider Network. No ADSBEx, no RealTraffic. Active channels are important as the issue is about auto-land aircraft.

Expected behavior No crash.

Technical Info:

1st Pattern in LTFlightData:916 calling pAc->GetToPos

Log.txt Two dumps report the same call stack: crash_report_01_24_2021_17_25_50.zip crash_report_01_22_2021_18_42_05.zip

Call Stack

[Inline Frame] LiveTraffic.xpl!positionTy::operator<(const positionTy &) Line 383
    at LiveTraffic\Include\CoordCalc.h(383)
LiveTraffic.xpl!LTAircraft::GetToPos() Line 1352
    at LiveTraffic\src\LTAircraft.cpp(1352)
LiveTraffic.xpl!LTFlightData::DataCleansing(bool & bChanged=false) Line 564
    at LiveTraffic\src\LTFlightData.cpp(564)
LiveTraffic.xpl!LTFlightData::CalcNextPos(double simTime=1611509086.0000000) Line 1016
    at LiveTraffic\src\LTFlightData.cpp(1016)
LiveTraffic.xpl!LTFlightData::CalcNextPosMain() Line 1353
    at LiveTraffic\src\LTFlightData.cpp(1353)
[Inline Frame] LiveTraffic.xpl!std::invoke(void(*)() &&) Line 1584
    at MSVC\14.28.29333\include\type_traits(1584)
LiveTraffic.xpl!std::thread::_Invoke<std::tuple<void (__cdecl*)(void)>,0>(void * _RawVals=0x00000240eaa342b0) Line 44
    at MSVC\14.28.29333\include\thread(44)
ucrtbase.dll!_crt_at_quick_exit()
kernel32.dll!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()

In GetToPos it crashes while applying the operator <:

// position heading to (usually posList.back(), but ppos if ppos > posList.back())
const positionTy& LTAircraft::GetToPos() const
{
    if ( posList.size() >= 2 )
===>    return ppos < posList.back() ? posList.back() : ppos;
    else
        return ppos;
}

The operator < is the actual point of bad memory access. It's a one-liner:

    inline bool operator<  (const positionTy& p) const { return ts() < p.ts(); }

The memory location accessed is reported as 0x0000000000000018. Probably, the timestamp ts is just 18 bytes down from the beginning. Interesting is, however, that the crash location is the operator <, not the also inlined function ts(), which returns one of the positionTy::v valarray elements.

2nd Pattern in LTFlightData:917 calling LTAptFindRwy

This is just one line down in CalcNextPos, ie. right after the call to pAc->GetToPos.

Log Text crash_report_01_22_2021_17_19_06.zip

Call Stack

LiveTraffic.xpl!LTAptFindRwy(const LTAircraft::FlightModel & _mdl, const positionTy & _from, double _speed_m_s, std::string & _rwyId, const std::string & _logTxt={...}) Line 2511
    at LiveTraffic\Src\LTApt.cpp(2511)
[Inline Frame] LiveTraffic.xpl!LTAptFindRwy(const LTAircraft &) Line 55
    at LiveTraffic\Include\LTApt.h(55)
LiveTraffic.xpl!LTFlightData::CalcNextPos(double simTime=1611335878.1228712) Line 917
    at LiveTraffic\src\LTFlightData.cpp(917)
LiveTraffic.xpl!LTFlightData::CalcNextPosMain() Line 1353
    at LiveTraffic\src\LTFlightData.cpp(1353)
... (same as above)

So we now call into LTAptFindRwy and crash while reading address 0x0000000000000020 (so it is 2 bytes later) in line 2511 of LTApt.cpp:

        const Apt& apt = iterApt->second;

        // Find the rwy endpoints matching the current plane's heading
        for (const RwyEndPt& re: apt.GetRwyEndPtVec())
        {
            // skip if rwy heading differs too much from flight heading
===>        if (std::abs(HeadingDiff(re.heading, _from.heading())) > ART_RWY_MAX_HEAD_DIFF)
                continue;

Analysis

The crash is in the LT_CalcPos worker thread, which is why the X-Plane log couldn't identify LiveTraffic as the crash didn't occur in X-Plane's main thread.

Situation is: Plane is approaching with no live data. Auto-land is kicking in. But before (1st pattern) or while (2nd pattern= trying to actually find a runway the crash happens at the above locations.

1st pattern We are in code of an aircraft object. The aircraft object has been used before. One could think into the direction of aircraft removal while this code is going on...but aircraft removal is pretty tightly safeguarded with two mutexes. And if that would be the reason we would hear a lot more people scream and it would not be limited to KDEN.

2nd pattern is totally unrelated to the aircraft object but combs through the apt scenery data. The objects involved are the airport, its rwy endpoints, and a position passed down from CalcNextPos. And guess what...the position is the result of a call to _ac.GetToPos() in LTAptFindRwy (LTApt.h:55).

So the constant in both patterns is the aircraft's position. But why at KDEN?

TwinFan commented 3 years ago

Could be linked to #208: The latter makes planes disappear rather suddenly quite often, way more often than what normally would be expected. So if there is still some kind of race condition that would allow a plane to be removed while in a separate thread calculations on it are still on, then the auto-land issue #208 would increase the likelihood of this race condition being triggered. How is that related to KDEN? Maybe only because it has rather much traffic but maybe OpenSky's coverage doesn't always extend down to ground, causing more auto-lands.

TwinFan commented 3 years ago

Accessing aircraft in LTFlightData::CalcNextPosMain is not guarded by the LTFlightData::dataAccessMutex mutex. Only a little later, in CalcNextPos (called by CalcNextPosMain) the lock is taken. Also, while removing aircraft completely, the global mapFdMutex is in use, but when starting to remove an individual plane the aircraft-specific LTFlightData::dataAccessMutex is not set. It is set a little later in the destructor of LTFlightData.

So there is a theoretical point in time, as short as it may be, for a race condition:

Solution: Consequently guard all LTFlightData member functions with the dataAccessMutex.

TwinFan commented 3 years ago

Code changes didn't solve the problem. New crashes surfaced at slightly different positions, obviously in different situation, but still it is the calculation thread, which crashes, and still it is when accessing the plane's current position Aircraft::ppos, complaining

Access violation reading location 0x0000000000000018

In the 4th pattern, the exception shown in Visual Studio even clearly reads:

Unhandled exception thrown: read access violation. ppos_i.v._Myptr was nullptr

stating that an internal pointer of the std::valarray implementation of the runtime library was nullptr causing the access violation when trying to access the timestamp, stored in the 4th valarray element.

While sorting all crash dumps I noted that one was left from yesterday, which I forgot to assign to any of the patterns...turns out: That crash dump of yesterday shows what is now noted here as pattern 3. => Pattern 3 is not new but happened before the code change already!

What strikes me is that for reaching the point of crash in the 4th pattern, execution came very surely past the spot in the 3rd pattern: pAc && posDeque.size() >= 1 must be fulfilled for the 4th pattern to happen, which is the condition allowing the code in the 3rd pattern to happen. Why didn't it crash there already?

3rd Pattern

Call Stack

[Inline Frame] LiveTraffic.xpl!positionTy::operator<(const positionTy &) Line 383
    at LiveTraffic\Include\CoordCalc.h(383)
LiveTraffic.xpl!LTAircraft::GetToPos() Line 1352
    at LiveTraffic\src\LTAircraft.cpp(1352)
LiveTraffic.xpl!LTFlightData::DataCleansing(bool & bChanged=false) Line 574
    at LiveTraffic\src\LTFlightData.cpp(574)
LiveTraffic.xpl!LTFlightData::CalcNextPos(double simTime=1611630439.0000000) Line 1032
    at LiveTraffic\src\LTFlightData.cpp(1032)
LiveTraffic.xpl!LTFlightData::CalcNextPosMain() Line 1375
    at LiveTraffic\src\LTFlightData.cpp(1375)
[Inline Frame] LiveTraffic.xpl!std::invoke(void(*)() &&) Line 1584
    at MSVC\14.28.29333\include\type_traits(1584)
LiveTraffic.xpl!std::thread::_Invoke<std::tuple<void (__cdecl*)(void)>,0>(void * _RawVals=0x00000153cec5fbe0) Line 44
    at MSVC\14.28.29333\include\thread(44)
ucrtbase.dll!_crt_at_quick_exit()
kernel32.dll!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()

So it is again Aircraft::GetToPos(), which crashes, we had that before in the 1st pattern. But it is called from a different place this time:

    // *** Remove weird positions ***
    if (( pAc && posDeque.size() >= 1) ||
        (!pAc && posDeque.size() >= 3))
    {
        positionTy pos1;
        vectorTy v2;
        double h1 = NAN;
        dequePositionTy::iterator iter = posDeque.begin();

        // position _before_ the first position in the deque
        if (pAc) {
--->        pos1 = pAc->GetToPos();
            h1 = pAc->GetTrack();       // and the heading towards pos1

crash_report_01_25_2021_15_28_06 - 3rd pattern.zip crash_report_01_26_2021_03_08_22.zip

4th Pattern

Call Stack

[Inline Frame] LiveTraffic.xpl!positionTy::ts() Line 414
    at LiveTraffic\Include\CoordCalc.h(414)
LiveTraffic.xpl!LTFlightData::CalcNextPos(double simTime=1611675999.0000000) Line 1149
    at LiveTraffic\src\LTFlightData.cpp(1149)
LiveTraffic.xpl!LTFlightData::CalcNextPosMain() Line 1375
    at LiveTraffic\src\LTFlightData.cpp(1375)
[Inline Frame] LiveTraffic.xpl!std::invoke(void(*)() &&) Line 1584
    at MSVC\14.28.29333\include\type_traits(1584)
LiveTraffic.xpl!std::thread::_Invoke<std::tuple<void (__cdecl*)(void)>,0>(void * _RawVals=0x0000026b4f2b3920) Line 44
    at MSVC\14.28.29333\include\thread(44)
ucrtbase.dll!_crt_at_quick_exit()
kernel32.dll!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()

The place calling positionTy::ts() is

            else for (size_t i = 0;
                      std::isnan(rotateTS) && posDeque.size() >= i+1;
                      i++ )
            {
                // i == 0 is as above with actual a/c present position
                // in later runs we use future data from our queue
                const positionTy& ppos_i  = i == 0 ? pAc->GetPPos() : posDeque[i-1];
                positionTy& to_i          = posDeque[i];
                const double to_i_ts      = to_i.ts();  // the reference might become invalid later once we start erasing, so we copy this timestamp that we need

                // we look up to 60s into the future
--->            if (ppos_i.ts() > simTime + MDL_TO_LOOK_AHEAD)
                    break;

crash_report_01_26_2021_15_47_41 - 4th pattern.zip

TwinFan commented 3 years ago

Will try to catch positionTy::v._Myptr == nullptr if I find a way to get access to that internal member variable, and bail out with a proper exception, which can be handled in the try/catch blocks. Maybe we can then get further insights this way.