getnamo / TensorFlow-Unreal

TensorFlow plugin for the Unreal Engine.
Other
1.15k stars 211 forks source link

Multithreading frequent calls is crashing #46

Open Envenger opened 5 years ago

Envenger commented 5 years ago

If i call the JSON input function around once every second, its crashing the game with a large python log. This happens on the CPU version of the plugin but I don't call any tensorflow commands so i doubt that would be causing it.

My JSON input function doesn't do anything for now, just returns an empty quote.

    def onJsonInput(self, jsonInput):
        result = ""
        return result

This crash doesn't occur with multi threading off. Attached the crash logs. ShooterAIProject-backup-2019.04.15-14.17.37.log

Envenger commented 5 years ago

Do you want any help in replicating the bug? I can share you a small project where you can replicate the bug.

getnamo commented 5 years ago

That would be helpful! I suspect some race condition is occurring, but that would be weird because python uses a global interpreter lock, perhaps I'm not doing something quite right.

Note to self: c++ function used for this is found at https://github.com/getnamo/UnrealEnginePython/blob/master/Source/UnrealEnginePython/Private/UEPyEngine.cpp#L814

Note2: this is new https://github.com/20tab/UnrealEnginePython/blob/0393f40181988789eeec95d1cd9d6eec811ec2a2/android/python27/include/ceval.h#L79, do we need to wrap our function with it?

magomedb commented 4 years ago

I have a similar issue where using multithreading with similarily frequent calls to the JSON input function result in the following error occuring repeatedly inside the output log.

File "C:\Users\User\Documents\Unreal Projects\IAF\IAF\Plugins\tensorflow-ue4\Content\Scripts\TensorFlowComponent.py", line 116, in json_input_blocking if(self.uobject.ShouldUseMultithreading): Exception: PyUObject is in invalid state

The engine crashes after a while and the crash log lists python36, kernel32 and ntdll. Can also see that the memory gets gradually filled over time while it's running, until the eventual crash. Turning multithreading off removes the errors and doesn't result in a crash, and the memory doesn't get filled either.

Assuming that this is a similar problem to the original, wouldn't this indicate that the problem is a memory leak?

getnamo commented 4 years ago

With multi-threading on each call to Json input gets handled by a different thread. What's likely happening is a second thread tries to touch the same data before the first one is done with it and it unravels the memory by accessing out of bounds data (despite GIL or in the TF layer), then the continued calls keep leaking (it should have crashed earlier). Probably the best way forward would be to figure out how to add a lock so that JSON inputs get queued one at a time, not starting a new input until last one is fully done. Optionally a single JSON input thread would be the more efficient solution (with internal event queue), but I'm unsure if I know how to properly do that atm.

Thanks for providing more examples of the bug, it does narrow down the potential source.

Uperstream commented 4 years ago

I have encountered the same problem. I requested json input event every tick as well. LogPython: Error: Exception in thread Thread-2291: Traceback (most recent call last): File "threading.py", line 916, in _bootstrap_inner File "threading.py", line 864, in run File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Content\Scripts\upythread.py", line 19, in backgroundAction result = action(actionArgs) File "F:\ProjectFile\UnrealProject\AItest\Plugins\tensorflow-ue4\Content\Scripts\TensorFlowComponent.py", line 116, in json_input_blocking if(self.uobject.ShouldUseMultithreading): Exception: PyUObject is in invalid state

I got a different error when I turned off the multithreading:

LogPython: Error: Variable pi/dense/kernel already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\framework\ops.py", line 1770, in init self._traceback = tf_stack.extract_stack() File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op op_def=op_def) File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func return func(*args, kwargs) LogPython: Error: Traceback (most recent call last): LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\tensorflow-ue4\Content\Scripts\TensorFlowComponent.py", line 143, in setup_complete self.train() LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\tensorflow-ue4\Content\Scripts\TensorFlowComponent.py", line 73, in train self.train_blocking() LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\tensorflow-ue4\Content\Scripts\TensorFlowComponent.py", line 152, in train_blocking self.trained = self.tfapi.onBeginTraining() LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Content\Scripts\DPPOnothread.py", line 194, in onBeginTraining self.model = PPO(epMax, 8, 5, 1e-4, 2e-4, self.que) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Content\Scripts\DPPOnothread.py", line 36, in init pi, pi_params = self._build_anet('pi', trainable=True) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Content\Scripts\DPPOnothread.py", line 68, in _build_anet l1 = tf.layers.dense(self.tfs, 200, tf.nn.relu, trainable=trainable) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\layers\core.py", line 184, in dense return layer.apply(inputs) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 817, in apply return self.call(inputs, *args, *kwargs) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\layers\base.py", line 374, in call outputs = super(Layer, self).call(inputs, args, kwargs) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 746, in call self.build(input_shapes) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\keras\layers\core.py", line 944, in build trainable=True) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\layers\base.py", line 288, in add_weight getter=vs.get_variable) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 609, in add_weight aggregation=aggregation) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\training\checkpointable\base.py", line 639, in _add_variable_with_custom_getter *kwargs_for_getter) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\ops\variable_scope.py", line 1487, in get_variable aggregation=aggregation) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\ops\variable_scope.py", line 1237, in get_variable aggregation=aggregation) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\ops\variable_scope.py", line 540, in get_variable aggregation=aggregation) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\ops\variable_scope.py", line 492, in _true_getter aggregation=aggregation) LogPython: Error: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\ops\variable_scope.py", line 861, in _get_single_variable name, "".join(traceback.format_list(tb)))) LogPython: Error: ValueError: Variable pi/dense/kernel already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at: File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\framework\ops.py", line 1770, in init self._traceback = tf_stack.extract_stack() File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op op_def=op_def) File "F:\ProjectFile\UnrealProject\AItest\Plugins\UnrealEnginePython\Binaries\Win64\Lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func return func(args, **kwargs)

I have seen this PyUObject is in invalid state error has been mentioned in the original repository. I don't really understand it. I'll post the link below.

https://github.com/20tab/UnrealEnginePython/blob/master/docs/MemoryManagement.md

Uperstream commented 4 years ago

I have fixed the error when I turned off multithreading. It works when I turned it off.