colyseus / colyseus-unity-sdk

⚔ Colyseus Multiplayer SDK for Unity
https://docs.colyseus.io/getting-started/unity-sdk/
MIT License
371 stars 102 forks source link

ColyseusRoom instance occassionally becoming null #164

Open tonygiang opened 3 years ago

tonygiang commented 3 years ago

I have just witnessed with my eyes a situation where the ColyseusRoom instance returned by ColyseusClient.JoinById task becomes null at some point after the task finishes running. Prior to this null check, the ColyseusRoom instance was able to run a series of ColyseusRoom.OnMessage calls without exception. This doesn't always happen and cannot be consistently reproduced, thus I suspect a race condition or a similar circumstance.

To be more specific: We saved the ColyseusRoom instance in a ScriptableObject asset that is referenced across multiple scenes. It was created when Scene A is running, then it receives 2 messages from the server, then Scene B is loaded, then a script in Scene B runs that null-checks this ColyseusRoom instance and that's when we found that it was null.

Additional information: It happened again, but this time, after the null-check we tried to await UniTask.WaitUntil(() => roomInstance != null) and it never runs past that. UniTask is a class from this package.

endel commented 3 years ago

Hi @tonygiang, thanks for letting us know! - Have you experienced this in Unity editor, native build, or WebGL?

UniTask can be a clue here. I personally never used it along with Colyseus, but they should be compatible.

tonygiang commented 3 years ago

Hi @tonygiang, thanks for letting us know! - Have you experienced this in Unity editor, native build, or WebGL?

UniTask can be a clue here. I personally never used it along with Colyseus, but they should be compatible.

This issue has happened with both Unity Editor and native build. We do not target WebGL. This issue happens regardless of whether there is UniTask or not. We detected a null room instance before we even put in a UniTask to await until it stops being null.

tonygiang commented 3 years ago

Upon diving into the Unity client source code, there's one part that concerns me. This is the full source of ColyseusClient.ConsumeSeatReservation:

        public async Task<ColyseusRoom<T>> ConsumeSeatReservation<T>(ColyseusMatchMakeResponse response,
            Dictionary<string, string> headers = null)
        {
            ColyseusRoom<T> room = new ColyseusRoom<T>(response.room.name)
            {
                Id = response.room.roomId,
                SessionId = response.sessionId
            };

            Dictionary<string, object> queryString = new Dictionary<string, object>();
            queryString.Add("sessionId", room.SessionId);

            room.SetConnection(CreateConnection(response.room.processId + "/" + room.Id, queryString, headers));

            TaskCompletionSource<ColyseusRoom<T>> tcs = new TaskCompletionSource<ColyseusRoom<T>>();

            void OnError(int code, string message)
            {
                room.OnError -= OnError;
                tcs.SetException(new CSAMatchMakeException(code, message));
            }

            void OnJoin()
            {
                room.OnError -= OnError;
                tcs.TrySetResult(room);
            }

            room.OnError += OnError;
            room.OnJoin += OnJoin;

            onAddRoom?.Invoke(room);

#pragma warning disable 4014
            room.Connect();
#pragma warning restore 4014

            return await tcs.Task;
        }

There is something that bothers me (and I'm not very sure about this since this is a bit more low-level than I'm comfortable with, but I still have to throw this theory out here).

So the TaskCompletionSource here supposely will finish its associated Task the first time it has its result set. Makes sense. This happens with the OnJoin callback, which in turn is invoked when the room first receives a message containing ColyseusProtocol.JOIN_ROOM. Now, the part that concerns me is the internal method passed into OnJoin is not self-removing (it only removes OnError) and in the case where a ColyseusProtocol.JOIN_ROOM message is sent twice for the same room (for whatever reason - could be a series of bytes that just happens to match ColyseusProtocol.JOIN_ROOM), what would happen when it runs tcs.TrySetResult(room) again?

Will the internal OnJoin method even exist after ConsumeSeatReservation has ran its course the first time? (can't suppress my C++ instinct on this one) Could the room reference be disposed and become null the second time OnJoin is called? Could the null value propagate to the Result property of the tcs.Task when this happen?

I'm just spitballing every possible scenario here. It's so hard to think of another reason why the result of ColyseusClient.JoinById could turn null on its own.