TwinFan / XPMP2

Multiplayer library for X-Plane 11/12 with instancing, TCAS override, and sound
https://twinfan.github.io/XPMP2/
Other
24 stars 16 forks source link

CTD in Aircraft::DoMove (fixed by X-Plane 11.55) #23

Closed TwinFan closed 2 years ago

TwinFan commented 4 years ago

Describe the bug CTD in the call to XPLMInstanceSetPosition:

void Aircraft::DoMove ()
{
    // Only for visible planes
    if (bVisible) {
        // Already have instances? Or succeeded in now creating them?
        if (!listInst.empty() || CreateInstances()) {
            // Move the instances (this is probably the single most important line of code ;-) )
            for (XPLMInstanceRef hInst: listInst)
                XPLMInstanceSetPosition(hInst, &drawInfo, v.data());
        }
    }
}

According to log situation seems to be:

Note the fact that art controls are modified...but in this case it looks rather like a lame excuse.

To Reproduce Yet unclear

Expected behavior No crash

Technical Info:

Log.txt

Log_Dump_CTD_DoMove.zip

TwinFan commented 3 years ago

One more user reports this kind of problem, happened so far twice to him (see Support Forum). 1st time.zip (Log was not debug level) 2nd time.zip (Log is debug level) ​ Call stack:

X-Plane.exe!00007ff767e24317()
X-Plane.exe!00007ff7678181cc()
X-Plane.exe!00007ff7678139a5()
XPLM_64.dll!00007ffb66dfc754()
[Inline Frame] LiveTraffic.xpl!XPMP2::Aircraft::DoMove() Line 416
    at C:\Users\Nutzer\Documents\Programming\XPMP2\src\Aircraft.cpp(416)
LiveTraffic.xpl!XPMP2::Aircraft::FlightLoopCB(float _elapsedSinceLastCall=0.0172119141, float __formal=1.121e-44#DEN, int _flCounter=20441, void * __formal=0x0000000000000000) Line 384
    at C:\Users\Nutzer\Documents\Programming\XPMP2\src\Aircraft.cpp(384)
XPLM_64.dll!00007ffb66e05e04()
XPLM_64.dll!00007ffb66dfbc12()
X-Plane.exe!00007ff767894b7c()
...

In the two available debug-level Log.txt files the last entry before crash has been

LT/XPMP2 DEBUG Aircraft.cpp:496/CreateInstances: Aircraft 0xA3F58D: Instance created

So the only clue we currently have is that the problem may happen more likely to just created instances. It is unclear from log or stack dump if the new instance was just created in the same flight loop cycle. But that would at least be something we could start working with: Maybe we can lower the crash likelihood if we leave a newly create instance alone for one cycle and start moving it only in the next. ​

TwinFan commented 3 years ago

xPilot's developer @justinshannon contacted me on Discord saying that he has users with the same issue, analyzed a bit and even contacted Ben Supnik of Laminar. Crash is when X-Plane copies the animation dataRefs (array XPMP2::Aircraft::v) with memcpy. I think this can only happen if either the source (XPMP2::Aircraft::v.data()) or the destination (wherever X-Plane puts it) are invalid, ie. either completely wrong pointers or include unallocated memory parts.

Size certainly also plays a role. Size of v is fixed during runtime, but can - when there are no planes - be changed by use of XPMPAddModelDataRef().

Ben suggested to dump the contents of v before calling XPLMInstanceSetPosition to verify if the inaccessible memory is noticed by the library already or only later in X-Plane. I think the same effect could be achieved by just reading/accessing first and last element of the array. That includes validation of the size as compared to what has been passed on to the original call to XPLMCreateInstance.

So essentially to move forward, an analysis version of XPMP2 would include:

TwinFan commented 3 years ago

Checking source code, I had to remember that no array sizes are actually passed on to X-Plane. The number of dataRefs is controlled by the original call to XPLMCreateInstance, in which the datarefs parameter contains a list of text pointers, and the last one of these pointers must be NULL (which isn't even clearly documented). This is inherently less robust than passing a clear-cut size...but that's how it is at the moment.

So to validate we do everything correct every time again and again:

TwinFan commented 3 years ago

Implemented above checks in the "next" branch of XPMP2. Changes are:

During startup XPMP reports a version number (all future version of XPMP will do), which is 1.11 for this analysis version:

0:00:00.000    LT/XPMP2 INFO  XPMPMultiplayer.cpp:154/XPMPMultiplayerInit: XPMP2 1.11 initializing under X-Plane version 11503/Vulkan and XPLM version 303

The above four validations are implemented. If they trigger (I tested by tweaking values in the debugger) then FATAL log entries like the following happen:

0:04:57.978    LT/XPMP2 FATAL Aircraft.cpp:528/CreateInstances: ASSERT FAILED: numDataRefsDuringCreateInstance == 0 || numDataRefsDuringCreateInstance == DR_NAMES.size()
0:04:57.978    LT/XPMP2 FATAL Aircraft.cpp:410/FlightLoopCB: EXCEPTION CAUGHT: 0:04:57.978    LT/XPMP2 FATAL Aircraft.cpp:528/CreateInstances: ASSERT FAILED: numDataRefsDuringCreateInstance == 0 || numDataRefsDuringCreateInstance == DR_NAMES.size()
...
0:05:46.590    LT/XPMP2 FATAL Aircraft.cpp:444/DoMove: XPMP2 Issue 23: Array sizes mismatch! v: 43, DR_NAMES: 43, numDataRefsDuringCreateInstance: 2
0:05:46.590    LT/XPMP2 FATAL Aircraft.cpp:445/DoMove: while processing A/C 0x06A2B7, object instance 0000029E361CA520
0:05:46.590    LT/XPMP2 FATAL Aircraft.cpp:447/DoMove: ASSERT FAILED: (v.size() == DR_NAMES.size()) && (DR_NAMES.size() == numDataRefsDuringCreateInstance)
0:05:46.590    LT/XPMP2 FATAL Aircraft.cpp:410/FlightLoopCB: EXCEPTION CAUGHT: 0:05:46.590    LT/XPMP2 FATAL Aircraft.cpp:447/DoMove: ASSERT FAILED: (v.size() == DR_NAMES.size()) && (DR_NAMES.size() == numDataRefsDuringCreateInstance)

In such cases processing is terminated as part of the exception handling, the plane destructed and hence no more calls to the XPLM Instancing API done. That also means that a user will not notice it immediately...so any tester must look into the Log.txt after a flight. But that's expected of testers, isn't it?

Just before calling XPLMInstanceSetPosition there is also a loop going over all dataRef values (the actual floats) and validating there aren't NAN. That makes sure each and every float value is actually accessible:

const float* data = v.data();
                for (size_t i = 0; i < numDataRefsDuringCreateInstance; ++i)
                    LOG_ASSERT(!std::isnan(data[i]));

If such a read access fails then this is an access violation, which will terminate X-Plane. But likely we see a Log.txt entry pointing to the plugin. And definitely I would see a call stack in the crash dump (.../Output/crash_reports/.dmp or .rpt) pointing into the above lines of code. So in such a case I need the crash dump for further analysis.

TwinFan commented 3 years ago

Two cases have been reported where xPilot crashed at the very same place but with the above validation included. The validations have not found anything wrong, but X-Plane crashed nonetheless.

We'll need to bring this to Ben's attention once again.

TwinFan commented 3 years ago

Following Ben's suggestion I'll build a "stress test plugin", which randomly and quickly creates and destroys models. So it will build up some planes to keep first, randomly set their position and a few dataRefs, destroy radnomly some every flight loop callback and creating some, too.

TwinFan commented 3 years ago

Ran a test for 13.5 hours straight, which created 3,132,171 planes successfully. X-Plane then showed a "low on memory" message, so maybe there's a small memory leak along loading objects, creating/removing instances. But even after 3,132,171 planes there's no crash. So there must be more to it than "just" instance creation and destruction.

TwinFan commented 3 years ago

Ben/Laminar are offering to build a debug version of X-Plane...if only we could reproduce the issue. Following an analysis of available logs suggesting that the presence of some plugins might be necessary (like SAM, GroundTraffic, or librain, as they show up in most if not all logs) I've installed a scenery with GoundTraffic and SAM and installed Zibo Mod's B738, which uses librain. But even then ran the stress test with no crash for 9 hours producing more than 1.8 million plane instances.

TwinFan commented 3 years ago

Merged above request #33 which X-Pilot developers hope to have a positive effect on the issue. I'm less optimistic, see the request`s comments.

justinshannon commented 3 years ago

Unfortunately, @mikeyusc's PR didn't fix our issue. I received a crash report this morning from a user using the latest version of xPilot that includes this fix. Same call stack as before.

X-Plane.exe!00007ff7e8d158f7()
X-Plane.exe!00007ff7e870932c()
X-Plane.exe!00007ff7e8704b05()
XPLM_64.dll!00007ffeb22ec754()
xPilot.xpl!XPMP2::Aircraft::DoMove(void)
xPilot.xpl!XPMP2::Aircraft::FlightLoopCB(float,float,int,void *)
XPLM_64.dll!00007ffeb22f5e04()
XPLM_64.dll!00007ffeb22ebc12()
X-Plane.exe!00007ff7e8785fac()
X-Plane.exe!00007ff7e7d67515()
X-Plane.exe!00007ff7e8183d2a()
X-Plane.exe!00007ff7e8ad10ba()
X-Plane.exe!00007ff7e8ac3d56()
X-Plane.exe!00007ff7e8cf26d7()

Still trying to elicit a response from Ben on an X-Plane debug build...

justinshannon commented 3 years ago

I combed through my Discord server and gathered the logs I could find. X-Plane Logs.zip

TwinFan commented 3 years ago

I had started to look into the above set of logs. Not all reviewed yet. But the only common thing to all I've seen so far is that all sessions use either BetterPushback or GrondTraffic, often both. Ie. there's always a plugin in use, which also creates instanced objects (which underpins a theory discussed with Laminar).

Today I happened to receive a crash from a LiveTraffic users. He's running a beta version for me with debug-level log output, which is fortunate. The change in the beta deals with modified lock handling to better safeguard LT data structures across worker threads. XPMP2 is unchanged compared to official LiveTraffic (which is not running the validation code, which is active in xPilot builds). Log and crash dump look as usual:

The crash happened rather quickly after just 8 minutes. The user as agreed to just try again. CTD_DoMove_v2.41b.zip

TwinFan commented 3 years ago

One more case confirmed (a second likely) in this post of the LiveTraffic forum, happening with LiveTraffic v2.40. Here with BetterPushback and Art Controls Modified, ie. fits into the other cases. Log wasn't DEBUG level, but last line happened to be a warning about a plane to be created with standard A320 model, so also here it is likely that LiveTraffic was creating a new instance.

TwinFan commented 3 years ago

Analysis together with Laminar unveiled that the likely cause is that X-Plane tries to copy more dataRef values than intended and we think we had passed in during the previous call to XPLMCreateInstance. The validation code inserted a view commits back validates all these figures and sizes and that NULL elements are added at the right places, so the mystery that remains is: Why do X-Plane and XPMP2 in those crash szenarios not agree on the number of dataRef values?

I was able to purposely enforce the crash by providing 100 more dataRef strings than values...and not even on the first aircraft created...but pretty soon.

That means the frequency of the crash to occur could be reduced by providing more memory than expected in the call to XPLMInstanceSetPosition. That wouldn't be a fix, far from it...it would just mean less frequent crashes. Not sure where that would lead us to...

TwinFan commented 3 years ago

I was able to reproduce the real crash (with published software versions) once now! Scenario was LiveTraffic at busy airport KLAX (100 planes in the air), 3 parallel instances of the stress test plugin (producing new planes and removing planes all the time), GroundTraffic-enhanced version of a KLAX scenery, plus BetterPushback. So quite a lot combined...but it did produce the crash some one or two minutes after finishing one pushback., 11 minutes into the session.

I'll try again...second attempt is now runnng for 20 minutes already without a crash...and continued to do so the next 9 hours through the night. So I still cannot yet reproduce the crash at will.

TwinFan commented 3 years ago

Summary of analysis as of today

Issue 23 describes a crash due to memory access violations. Analysis, also together with Laminar, says, that it happens while XPLMInstanceSetPosition copies the provided dataRef values into local storage, apparently reading more memory than provided.

Lot's of validtion code had been introduced to verify that XPMP2 provides the expected amount of dataRef names to XPLMCreateInstance and subsequently the same amount of data to XPLMInstanceSetPosition, still crashes happen very very rarely.

All crashes happen when other plugins are involved, too, that create instances, like GroundTraffic or BetterPushback. So the theory is that somehow there is a cross-plugin interference, most likely in the area of XPLMCreateInstance. This is underpinned by the fact that all crashes happen shortly after a new instance had been created (as shown by DEBUG-level Log.txt entries), though it could not yet be confirmed that the crash happened right with this newly create instance; it's just only very likely at the current state of analysis.

The number of dataRef values is defined in the call to XPLMCreateInstance by way of the number of dataRef strings passed in. In the subsequent call to XPLMInstanceSetPosition the plugin can no longer control the amount of data read by XPLMInstnceSetPosition directly: It will read as many floats as it thinks it got defined during the XPLMCreateInstance call. Somewhere between the plugin's call to XPLMCreateInstance and the call to XPLMInstanceSetPosition the problem occurs and these counts differ, making X-Plane believe it needs to read more dataRef float values than provided.

The just checked-in code aims at reducing the impact of the copy operation in XPLMInstanceSetPosition going too far: Instead of passing dynamically allocated heap memory (as it was before, when passing XPMP2::Aircraft::v.data() to XPLMInstanceSetposition), XPMP2 now copies the content of said v.data() to a statically(!) provided buffer in DATA segment of the plugin. Even if XPLMInstanceSetMemory reads too much, chances are way better that it will not tap into prohibited regions of memory, causing a access violation and crash.

This shall make the crash less likely to happen. THIS IS NOT A FIX!

The downside is that we need to allocate a fixed amount of static memory at compile time. This is no problem if the plugin only uses the standard dataRefs provided by XPMP2 directly. But XPMP2 allows to add more dataRefs to control by calls to XPMP2::XPMPAddModelDataRef(). Unfortunately, we now have to restrict how many such calls can be made.

TwinFan commented 3 years ago

Finally, the underlying cause is found: In a developer channel, Laminar confirmed a bug in the X-Plane Instancing SDK, which could cause X-Plane to copy too much memory, which "completely expains the crashes y'all saw. sim bug."

The fix seems to be in the X-Plane binary, a preview of it was provided on the developer channel. So the solution will likely come with the next X-Plane release.

TwinFan commented 3 years ago

X-Plane 11.55 fixes the underlying bug, see release notes or blog.

This issue stays open for a while to undo some of the workarounds like this one in due time.