UE4SS-RE / RE-UE4SS

Injectable LUA scripting system, SDK generator, live property editor and other dumping utilities for UE4/5 games
http://docs.ue4ss.com/
MIT License
1.01k stars 146 forks source link

[BUG - main] StaticConstructObject hook causes crashes in loading from FAsyncPackage::EventDrivenCreateExport #317

Closed Yangff closed 5 months ago

Yangff commented 5 months ago

Branch or Release main

Game and Engine Version UE5.1.1 Palworld

Describe the bug Crashes occur randomly when running the game, and the following dumps have been collected for two different crashes:

Crashes when executing the original StaticConstructObject_Internal, with the following calling stack: StaticConstructObject_Internal > StaticAllocateObject > UObjectBase::AddObject> FUObjectArray::AllocateUObjectIndex

AllocateUObjectIndex crashes when checking UObjectCreateListeners. https://github.com/EpicGames/UnrealEngine/blob/072300df18a94f18077ca20a14224b5d99fee872/Engine/Source/Runtime/CoreUObject/Private/ UObject/UObjectArray.cpp#L245

Here unreal reads the callback function and makes a callback, however the Data pointer for UObjectCreateListeners is null when Num() is not 0, disassemble as follow:

Snipaste_2024-01-29_22-01-21

Another crash looks like it's calling into address zero. image

The caller function is FAsyncPackage::EventDrivenCreateExport(FAsyncPackage *this, int LocalExportIndex) at: https://github.com/EpicGames/UnrealEngine/blob/072300df18a94f18077ca20a14224b5d99fee872/Engine/Source/Runtime/CoreUObject/Private/Serialization/AsyncLoading.cpp#L3377

Notice that the first dump has a incomplete stack which says the caller is 0x4.. and if we inspect that stack we can see the trampoline did something after the call. image And the real RIP is 0x7ff631b98718 which is again FAsyncPackage::EventDrivenCreateExport(FAsyncPackage *this, int LocalExportIndex)

From this we conclude that these two seemingly different crashes at the game start should be due to similar causes.

Since minidump is missing quite a bit of information I can't make any further analysis, but I suspect that maybe because the hooks and asynchronous loading are happening at the same time, perhaps consider pausing other threads or hooking earlier while hooking.

To Reproduce UEMinidump.dmp UEMinidump.dmp The minidump doesn't contain the memory of the program segment, so no sign of a hook can be seen in it, but the stack is included and trampoline traces can be seen.

trumank commented 5 months ago

There are a few concerns with how the object listeners are handled and I have not taken a deep dive into fixing it in UE4SS, but I have explored it a bit for my own tooling.

The create and delete listeners are called from many threads, not just the main thread, so any listeners must be threads safe. This is especially so when loading assets from disk as the loader seems to be async so as to not block the main thread, I assume. During typical gameplay/game tick the listeners seem to be mostly called only from the main thread.

The lifetime of the object pointer passed to the listener also seems potentially problematic to me because the listener is called outside of the object array lock, after the object bookkeeping has been done, so there is potential for the object to be deleted during or before the listener call. Whether that happens in practice, I do not know. UObjects are guaranteed to exist until the end of the current game tick, but that doesn't help when the listener is called outside of the main thread.

And, of course, registering the listeners (well, the create listener anyway) isn't thread safe either, though I'm a little surprised that's the crash we're seeing here. I was expecting it to come from something inside the listeners themselves. The delete listener does actually have a critical section in FUObjectArray struct that can be used to register safely, though the create listener would need to be made safe another way.

UE4SS commented 5 months ago

There are a few concerns with how the object listeners are handled and I have not taken a deep dive into fixing it in UE4SS, but I have explored it a bit for my own tooling.

The create and delete listeners are called from many threads, not just the main thread, so any listeners must be threads safe. This is especially so when loading assets from disk as the loader seems to be async so as to not block the main thread, I assume. During typical gameplay/game tick the listeners seem to be mostly called only from the main thread.

The lifetime of the object pointer passed to the listener also seems potentially problematic to me because the listener is called outside of the object array lock, after the object bookkeeping has been done, so there is potential for the object to be deleted during or before the listener call. Whether that happens in practice, I do not know. UObjects are guaranteed to exist until the end of the current game tick, but that doesn't help when the listener is called outside of the main thread.

And, of course, registering the listeners (well, the create listener anyway) isn't thread safe either, though I'm a little surprised that's the crash we're seeing here. I was expecting it to come from something inside the listeners themselves. The delete listener does actually have a critical section in FUObjectArray struct that can be used to register safely, though the create listener would need to be made safe another way.

I can confirm that we don't take thread safety into account regarding the add/delete listeners. I've never actually seen a game have any listeners, and the fact that there is no critical section for the add listeners made me not bother implementing it for the delete listeners. EDIT: I suppose it's entirely possible that a game adds a listener, and removes it long before I've looked in memory to see if any listeners exist.

We use these listeners for all our UObject caches, which are supposed to speed things up but might actually be causing lag, I'm a bit confused at this point. We also use them for the Live View tab to let us know when we need to remove an entry because otherwise we'll be operating on bad data when showing property values and retrieving names. It's also used for keeping history (which I plan to remove eventually) and search results in the Live View tab up-to-date. Finally, it's used to verify that objects are still alive when calling IsValid on UObjects from Lua.

Yangff commented 5 months ago

At least one problem I'm seeing here is:

        FORCEINLINE SizeType AddUninitialized(SizeType Count = 1)
        {
            CheckInvariants();
            const SizeType OldNum = ArrayNum;
            if (Count < 0)
            {
                Output::send<LogLevel::Warning>(STR("Number of elements to add must be greater than 0.\n"));
                return OldNum;
            }

            if ((ArrayNum += Count) > ArrayMax)
            {
                ResizeGrow(OldNum);
            }
            return OldNum;
        }

Here the ArrayNum is added before the memory get allocated. If that that time the loader access it it will assume the data pointer is valid and read it. Sure the structure is not thread safe anyway but maybe a workaround would improve if the order get changed? That is assuming no Add from the unreal engine as they will still use the old order.

Hmmm.. no it's still bad because the ArrayNum gets to an uninitilized value which will be used for function call.. should add the num only when the data is available.

UE4SS commented 5 months ago

At least one problem I'm seeing here is:

        FORCEINLINE SizeType AddUninitialized(SizeType Count = 1)
        {
            CheckInvariants();
            const SizeType OldNum = ArrayNum;
            if (Count < 0)
            {
                Output::send<LogLevel::Warning>(STR("Number of elements to add must be greater than 0.\n"));
                return OldNum;
            }

            if ((ArrayNum += Count) > ArrayMax)
            {
                ResizeGrow(OldNum);
            }
            return OldNum;
        }

Here the ArrayNum is added before the memory get allocated. If that that time the loader access it it will assume the data pointer is valid and read it. Sure the structure is not thread safe anyway but maybe a workaround would improve if the order get changed? That is assuming no Add from the unreal engine as they will still use the old order.

Hmmm.. no it's still bad because the ArrayNum gets to an uninitilized value which will be used for function call.. should add the num only when the data is available.

I don't think the solution is to make thread related modifications to TArray because introducing such things could cause performance problems even when thread safety isn't a concern, and as far as I know UE itself doesn't do thread safety stuff in TArray either.

I do think it's a good idea to only modify ArrayNum after the data pointer has been updated though, and I don't know why they don't do that in the actual UE code. I'll also say that there's always been an internal push from various directions to keep the code in the Unreal repo as close to the real UE code as possible so I don't know if this kind of change will happen.

Yangff commented 5 months ago

I do think it's a good idea to only modify ArrayNum after the data pointer has been updated though, and I don't know why they don't do that in the actual UE code.

Well, normally nobody assume a container should be thread safe, unless specified. So they can write the code however they want and that's okay.

At least for this one is because unreal is missing a lock here so not much we can do for a data structure that could be access by multiple threads even within UE. Maybe the only actual safe time to change it at this point is to wait until unreal's GC is running so that it will not create any new uobject. But that's still tricky.

As a workaround, I think one may work is to use TArray::Append(TArray) instead.

        /**
         * Appends the specified array to this array.
         *
         * @param Source The array to append.
         * @see Add, Insert
         */
        template <typename OtherElementType, typename OtherAllocator>
        void Append(TArray<OtherElementType, OtherAllocator>&& Source)
        {
            if (!((void*)this != (void*)&Source)) { Output::send<LogLevel::Error>(STR("Unable to append an array it itself.\n")); return; }

            SizeType SourceCount = Source.Num();

            // Do nothing if the source is empty.
            if (!SourceCount)
            {
                return;
            }

            // Allocate memory for the new elements.
            Reserve(ArrayNum + SourceCount);
            RelocateConstructItems<ElementType>(GetData() + ArrayNum, Source.GetData(), SourceCount);
            Source.ArrayNum = 0;

            ArrayNum += SourceCount;
        }

Here they reserve the space first, place the item and add the count at the end.

Still, even we're not changing any code it will be kind of depend on its internal behaviour. But, UE5 is there so this should reduce the chance for some of the crashing.. I guess.

For this I created the PR here https://github.com/Re-UE4SS/UEPseudo/pull/72

Yangff commented 5 months ago

I did some tests on change it from Add() to Append() in Derived427.cpp which is used by UE5.1.1. Looks that it does reduced the frequency of crash.. normally it crashes 1/15 and I tried to boot it 35 times and no crash observed.

Another crash on the hook point is less frequent. Ultimately caused by not suspend all threads, but also bug inside the polyhook implementation.

Snipaste_2024-01-31_01-38-38 Snipaste_2024-01-31_01-39-13 Snipaste_2024-01-31_01-39-59

That is not going to be a fix, because even with that there are sill chance that you hook during the normal execution, but it should reduce some possibility (I'd say maybe reduced 50% somehow?) but that really depends on how CPU handles I-cache and etc.. but at least that's when it's sequential execution. When it executes a jmp, there is longer dealy for the pipeline to fetch and etc... where you have longer gap for racing and sync.

The chances of an initial crash are about 1/8 to 1/12, in my case. By replacing add with append, that's about 1/25~1/30. and after changing the order I haven't encountered a crash in about 45 tries, my expectation was that this would make the crash less than 1%, but I don't have that much energy left to test it.

Yangff commented 5 months ago

As for the polyhook, idk what your plan for it, but I'd suggest to apply following patch:

diff --git a/polyhook2/Instruction.hpp b/polyhook2/Instruction.hpp
index dde73eb..e95788d 100644
--- a/polyhook2/Instruction.hpp
+++ b/polyhook2/Instruction.hpp
@@ -452,7 +452,7 @@ inline PLH::insts_t makex64MinimumJump(const uint64_t address, const uint64_t de
        std::stringstream ss;
        ss << std::hex << "[" << destHolder << "] ->" << destination;

-       return { Instruction(address, disp, 2, true, true, bytes, "jmp", ss.str(), Mode::x64),  specialDest };
+       return { specialDest, Instruction(address, disp, 2, true, true, bytes, "jmp", ss.str(), Mode::x64) };
 }

 inline PLH::insts_t makex86Jmp(const uint64_t address, const uint64_t destination) {

It's a dirty workaround that may just benefit the palworld as other game use different strategy may have other problem but I'm not going to fix polyhook anyway.. It simply works and you should always avoid to have a chance to refer to uninitialized data as a simple principle.. If that looks good plz integrate it into your current cmake script..

UE4SS commented 5 months ago

As for the polyhook, idk what your plan for it, but I'd suggest to apply following patch:

diff --git a/polyhook2/Instruction.hpp b/polyhook2/Instruction.hpp
index dde73eb..e95788d 100644
--- a/polyhook2/Instruction.hpp
+++ b/polyhook2/Instruction.hpp
@@ -452,7 +452,7 @@ inline PLH::insts_t makex64MinimumJump(const uint64_t address, const uint64_t de
        std::stringstream ss;
        ss << std::hex << "[" << destHolder << "] ->" << destination;

-       return { Instruction(address, disp, 2, true, true, bytes, "jmp", ss.str(), Mode::x64),  specialDest };
+       return { specialDest, Instruction(address, disp, 2, true, true, bytes, "jmp", ss.str(), Mode::x64) };
 }

 inline PLH::insts_t makex86Jmp(const uint64_t address, const uint64_t destination) {

It's a dirty workaround that may just benefit the palworld as other game use different strategy may have other problem but I'm not going to fix polyhook anyway.. It simply works and you should always avoid to have a chance to refer to uninitialized data as a simple principle.. If that looks good plz integrate it into your current cmake script..

We have an open pull request for changing from PolyHook to safetyhook, but it's been sitting there for some time now and I don't know the status. This might not be a problem in safetyhook but either way we should decide if we should switch or not before making changes in the hooking library. @localcc @narknon Any update on the status of the safetyhook PR ?

narknon commented 5 months ago

Last thing I heard on safetyhook is that it was still awaiting a PR from pray that'd fix one of the major problems. I'll check if he ever PR'd

UE4SS commented 5 months ago

This was partially fixed by https://github.com/Re-UE4SS/UEPseudo/pull/72 I'd like to keep this issue open until it's been fully fixed, which will hopefully be if/when we switch to safetyhook, otherwise we'll probably end up forking PolyHook and submit a fix for this, assuming it's not already fixed since we're using an old version of PolyHook from May 2023.

Yangff commented 5 months ago

This was partially fixed by Re-UE4SS/UEPseudo#72 I'd like to keep this issue open until it's been fully fixed, which will hopefully be if/when we switch to safetyhook, otherwise we'll probably end up forking PolyHook and submit a fix for this, assuming it's not already fixed since we're using an old version of PolyHook from May 2023.

PolyHook just merged the PR to fix that issue. https://github.com/stevemk14ebr/PolyHook_2_0/pull/196 but if you want to keep using the old version for now, add a PATCH_COMMAND may be easier since the're also using a higher version of Zydis.

UE4SS commented 5 months ago

This was partially fixed by Re-UE4SS/UEPseudo#72 I'd like to keep this issue open until it's been fully fixed, which will hopefully be if/when we switch to safetyhook, otherwise we'll probably end up forking PolyHook and submit a fix for this, assuming it's not already fixed since we're using an old version of PolyHook from May 2023.

PolyHook just merged the PR to fix that issue. stevemk14ebr/PolyHook_2_0#196 but if you want to keep using the old version for now, add a PATCH_COMMAND may be easier since the're also using a higher version of Zydis.

This is great but this will have to wait a bit probably because of Zydis. We need to investigate the amount of effort (if any) it will take to update Zydis because they've been known to break a lot of stuff in their updates.

Since there hasn't been a response regarding safetyhook, I'm going to assume it's dead for now and I'll hopefully soon be able to look into updating PolyHook.

Regarding PATCH_COMMAND, I don't actually know what that is and it would take time to figure out what it is and how to use it and I'd rather wait and see if I have time to investigate Zydis in the next couple of days instead.

Yangff commented 5 months ago

This was partially fixed by Re-UE4SS/UEPseudo#72 I'd like to keep this issue open until it's been fully fixed, which will hopefully be if/when we switch to safetyhook, otherwise we'll probably end up forking PolyHook and submit a fix for this, assuming it's not already fixed since we're using an old version of PolyHook from May 2023.

PolyHook just merged the PR to fix that issue. stevemk14ebr/PolyHook_2_0#196 but if you want to keep using the old version for now, add a PATCH_COMMAND may be easier since the're also using a higher version of Zydis.

This is great but this will have to wait a bit probably because of Zydis. We need to investigate the amount of effort (if any) it will take to update Zydis because they've been known to break a lot of stuff in their updates.

Since there hasn't been a response regarding safetyhook, I'm going to assume it's dead for now and I'll hopefully soon be able to look into updating PolyHook.

Regarding PATCH_COMMAND, I don't actually know what that is and it would take time to figure out what it is and how to use it and I'd rather wait and see if I have time to investigate Zydis in the next couple of days instead.

The breaking API change from Zydis is that ZydisDecodedInstruction will not contain operands anymore, ZydisDecoderDecodeBuffer has been replaced by ZydisDecoderDecodeFull which takes an extra parameter that returns operands. There are 5 lines in UEPseudo and 1 in ASMHelper being affected by this. But can't say if there are any other changes that's beyond the interface.

PATCH_COMMAND is you can put the patch I sent above in the polyhook2 folder in this project and change:

FetchContent_Declare(PolyHook2
    GIT_REPOSITORY git@github.com:stevemk14ebr/PolyHook_2_0.git
    GIT_TAG 5eb9b64af251d0790af108cdc1274d83a952ded7
    GIT_PROGRESS ON)

to

FetchContent_Declare(PolyHook2
    GIT_REPOSITORY git@github.com:stevemk14ebr/PolyHook_2_0.git
    GIT_TAG 5eb9b64af251d0790af108cdc1274d83a952ded7
    PATCH_COMMAND git apply ${CMAKE_CURRENT_SOURCE_DIR}/PolyHook2/00-fixhookorder.patch
    UPDATE_DISCONNECTED 1
    GIT_PROGRESS ON)

so that the patch gets appied when the repo checkout by cmake.

UE4SS commented 5 months ago

Decide to rip the band-aid off and just do it. See https://github.com/Re-UE4SS/UEPseudo/pull/76 and https://github.com/UE4SS-RE/RE-UE4SS/pull/362 Both need to be merged for this to be fixed.