Open kesslermaximilian opened 1 year ago
Probably this is a bug introduced in commit e74d6bdf here:
I'm not really sure what it does, we first query the t.Spectators
attribute (I suppose this holds a list of all the spectators of a table?), but then we query the tables
struct and ask for the tables that a user is spectating, even though we obtained this user as a spectator from our current table.
The tables.GetTablesUserSpectating
method assumes that the caller holds the tables
lock and I suppose here the caller fails to do so (haven't checked the full stack trace yet to verify where that call came from).
I'm just in general confused on why we would need to lock at the tables
entries for this user at all to get the ActiveSpectators / what this method is supposed to do.
t.Spectators
for some table have 'inactive' spectators in it that we somehow do not want to return here?tables.spectating
map and the t.Spectators
attribute not kept in sync? I'd expect that a spectator has the table assigned in the map if and only if it is also an entry of the table itselfDepending on what the answers to these questions are, the possible solutions here might be:
t.Spectators
@Krycke
Yeah, the record of spectators on the table hold a list of all spectators that have been specific since the table was created. The spectator may or may not spectate right now.
The only trust worthy record of if a spectator is spectating actively right now is the record in tables that we get from the function tables.GetTablesUserSpectating(sp.UserID)
.
So now from the point of view of the table, we need to go through the list of all spectators, and see if any of them are spectating a table right now, and if they take appear to be this one.
This function in it self should not be the problem here, i have a hard time reading the stack trace on my phone, but i assume that there is a deadlock.
When i fixed this function, i was also confused about what we are locking and when, this function it self does not need the tables to be locked, if they are changed during checking it wouldn't matter, but this function uses functions that are might or might not need a lock.
The problem with finding dead locks in the stack trace is that it's not this thread that might be the problem, it's the thread that has claimed the lock the is suck somewhere that probably is the problem. But that one is not seen in the trace.
When i was trying to investigate the dead locks before i remember that the following could occur: thread 1
wants to first acquire lock 1
and while having that lock, it is trying to acquire lock 2
. The second lock is already kept by a thread 2
, so thread 1
gets stuck. At the same time a thread 3
is trying to acquire lock 1
and eventually times out. That's why we find the stack trace from the dead lock on thread 3
but i think the problem is actually that thread 2
won't let go of lock 2
or that thread 1
is trying to acquire a second lock before releasing the first one.
Either way i think all locks need to be checked in all functions, because something is strange with them.
I think the function here should definitely have the lock for the tables struct when being executed. I can see why we wouldn't care if the spectators list changes while being read, but (presumably) these reads are not atomic, so stuff might go wrong if another thread writes these at the same time. The error message from the trace explicitly says 'concurrent read and write', so it could very much be that the stack trace we see is the thread reading from the tables struct, while some other thread (hopefully holding the lock for that) is currently writing it because some spectator joined.
I also don't see how this resembles a deadlock at the moment. I don't know go much, but for a deadlock, I would expect a differen terror message, because there is actually no read or write occurring if a deadlock happens.
Also, do you know why tables.GetTablesUserSpectating
is the only reliable source for the active spectators? If a user leaves a table, why would we not update the list at the table itself, i.e. table.Spectators
? To me it would seem that this should be what we want to have (do we ever need these past spectators?), so probably we should just fix that and not look at the tables
struct here at all.
Regarding possible deadlocks in the Server code: Sure, there might be some, I have no idea and it would be quite a bit of work to check them all.
Ok, as i said, i could read the error message on my phone.
Before my changes that was actually how it worked, the spectator was removed when the spectator left. But the problem if that the messages written by the spectator is also supported in this struct, so by removing it we also lose what a spectator has written, that's why i changed it to active and inactive spectators.
This was to solve at least two things: if a spectator leaves and comes back (even just reloads the page) the messages should stay. This is extra important when shadowing a game. Secondly when the game has finished, the spectators are remade and then the messages are lost.
In the beginning of the redesign i was creating a separate list of active spectators, but then it could diverge from the table list. So it's better to have the information at only one place
but i assume that there is a deadlock.
To be clear, there is not a deadlock. Historically, when the server crashes, it is due to a deadlock, but not this time. This time, the error message is this:
fatal error: concurrent map read and map write
This means that a two goroutines are attempting to read and write to a map at the same time. Maps in Golang are not safe for concurrent use, so you are supposed to lock the corresponding map mutex first before doing any reads/writes.
@Krycke do you have time to try and fix this this weekend? i dont want the server to crash again
Sorry, i don't have time coding at the moment, I will probably not sit in front of a computer for the foreseeable future.
I'll revert the PR then
Unless @kesslermaximilian wants to take a look?
James posted this crash log on Discord: