Server Crash: Concurrent read/write on map

kesslermaximilian commented 1 year ago

James posted this crash log on Discord:

{"level":"info","ts":"Jul  26 23:18:28","caller":"src/command_table_spectate.go:74","msg":"Table #323 (sudoku 3p) - Turn 9 - User \"downvoid\" spectated."}
fatal error: concurrent map read and map write

goroutine 2590689 [running]:
runtime.throw(0xf30606, 0x21)
    /usr/lib/go-1.13/src/runtime/panic.go:774 +0x72 fp=0xc0004e31e0 sp=0xc0004e31b0 pc=0x4306b2
runtime.mapaccess2_fast64(0xdc1ec0, 0xc0003a98c0, 0x1504, 0x0, 0x1)
    /usr/lib/go-1.13/src/runtime/map_fast64.go:61 +0x1b5 fp=0xc0004e3208 sp=0xc0004e31e0 pc=0x413785
main.(*Tables).GetTablesUserSpectating(...)
    /root/hanabi-live/server/src/tables.go:209
main.(*Table).ActiveSpectators(0xc001404000, 0xc0039985a0, 0xc0004e32c8, 0x45a9e0)
    /root/hanabi-live/server/src/table.go:436 +0x1aa fp=0xc0004e32b0 sp=0xc0004e3208 pc=0xcd872a
main.replayActionArrow(0xc006086380, 0xc00000c960, 0xc001404000)
    /root/hanabi-live/server/src/command_replay_action.go:138 +0x5f fp=0xc0004e3308 sp=0xc0004e32b0 pc=0xc78f8f
main.replayAction(0x109b580, 0xc002f084e0, 0xc006086380, 0xc00000c960, 0xc001404000)
    /root/hanabi-live/server/src/command_replay_action.go:76 +0xb5 fp=0xc0004e3340 sp=0xc0004e3308 pc=0xc78cd5
main.commandReplayAction(0x109b580, 0xc002f084e0, 0xc006086380, 0xc00000c960)
    /root/hanabi-live/server/src/command_replay_action.go:67 +0x209 fp=0xc0004e33e8 sp=0xc0004e3340 pc=0xc78ad9
main.websocketMessage(0xc006086340, 0xc000bb9800, 0x30, 0x600)
    /root/hanabi-live/server/src/websocket_message.go:110 +0x46f fp=0xc0004e3568 sp=0xc0004e33e8 pc=0xce456f
github.com/gabstv/melody.(*Session).readPump(0xc006086340)
    /root/go/pkg/mod/github.com/gabstv/melody@v1.0.2/session.go:133 +0x19f fp=0xc0004e35d8 sp=0xc0004e3568 pc=0xc5830f
github.com/gabstv/melody.(*Melody).HandleRequestWithKeys(0xc00035e060, 0x7f63c6fd0ed0, 0xc000a84800, 0xc000a85900, 0xc0045ade60, 0xc0004e3738, 0x2)
    /root/go/pkg/mod/github.com/gabstv/melody@v1.0.2/melody.go:192 +0x1d1 fp=0xc0004e3648 sp=0xc0004e35d8 pc=0xc56d51
main.httpWS(0xc000a84800)
    /root/hanabi-live/server/src/http_ws.go:114 +0x4e5 fp=0xc0004e3768 sp=0xc0004e3648 pc=0xcaf7f5
github.com/gin-gonic/gin.(*Context).Next(0xc000a84800)
    /root/go/pkg/mod/github.com/gin-gonic/gin@v1.7.7/context.go:168 +0x3b fp=0xc0004e3788 sp=0xc0004e3768 pc=0x9b60ab
github.com/gin-contrib/sessions.Sessions.func1(0xc000a84800)
    /root/go/pkg/mod/github.com/gin-contrib/sessions@v0.0.4/sessions.go:54 +0x19e fp=0xc0004e3820 sp=0xc0004e3788 pc=0xc310de
github.com/gin-gonic/gin.(*Context).Next(0xc000a84800)
    /root/go/pkg/mod/github.com/gin-gonic/gin@v1.7.7/context.go:168 +0x3b fp=0xc0004e3840 sp=0xc0004e3820 pc=0x9b60ab
github.com/gin-gonic/gin.CustomRecoveryWithWriter.func1(0xc000a84800)
    /root/go/pkg/mod/github.com/gin-gonic/gin@v1.7.7/recovery.go:99 +0x6d fp=0xc0004e38b0 sp=0xc0004e3840 pc=0x9cc16d
github.com/gin-gonic/gin.(*Context).Next(0xc000a84800)
    /root/go/pkg/mod/github.com/gin-gonic/gin@v1.7.7/context.go:168 +0x3b fp=0xc0004e38d0 sp=0xc0004e38b0 pc=0x9b60ab
github.com/gin-gonic/gin.LoggerWithConfig.func1(0xc000a84800)
    /root/go/pkg/mod/github.com/gin-gonic/gin@v1.7.7/logger.go:241 +0xe1 fp=0xc0004e3a70 sp=0xc0004e38d0 pc=0x9cb2e1
github.com/gin-gonic/gin.(*Context).Next(0xc000a84800)
    /root/go/pkg/mod/github.com/gin-gonic/gin@v1.7.7/context.go:168 +0x3b fp=0xc0004e3a90 sp=0xc0004e3a70 pc=0x9b60ab
github.com/gin-gonic/gin.(*Engine).handleHTTPRequest(0xc0003a0000, 0xc000a84800)
    /root/go/pkg/mod/github.com/gin-gonic/gin@v1.7.7/gin.go:555 +0x637 fp=0xc0004e3be0 sp=0xc0004e3a90 pc=0x9c1527
github.com/gin-gonic/gin.(*Engine).ServeHTTP(0xc0003a0000, 0x10921c0, 0xc005f029a0, 0xc000a85900)
    /root/go/pkg/mod/github.com/gin-gonic/gin@v1.7.7/gin.go:511 +0x16c fp=0xc0004e3c18 sp=0xc0004e3be0 pc=0x9c0c5c
net/http.serverHandler.ServeHTTP(0xc000712380, 0x10921c0, 0xc005f029a0, 0xc000a85900)
    /usr/lib/go-1.13/src/net/http/server.go:2802 +0xa4 fp=0xc0004e3c48 sp=0xc0004e3c18 pc=0x6e8774
net/http.(*conn).serve(0xc00437d040, 0x109b4c0, 0xc005215300)
    /usr/lib/go-1.13/src/net/http/server.go:1890 +0x875 fp=0xc0004e3fc8 sp=0xc0004e3c48 pc=0x6e4015
runtime.goexit()
    /usr/lib/go-1.13/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc0004e3fd0 sp=0xc0004e3fc8 pc=0x45d461
created by net/http.(*Server).Serve
    /usr/lib/go-1.13/src/net/http/server.go:2928 +0x384

kesslermaximilian commented 1 year ago

Probably this is a bug introduced in commit e74d6bdf here:

https://github.com/Hanabi-Live/hanabi-live/blob/dd932e3a9b44719df2d4f2b7e28e9ab4c67e2388/server/src/table.go#L433-L443

I'm not really sure what it does, we first query the t.Spectators attribute (I suppose this holds a list of all the spectators of a table?), but then we query the tables struct and ask for the tables that a user is spectating, even though we obtained this user as a spectator from our current table. The tables.GetTablesUserSpectating method assumes that the caller holds the tables lock and I suppose here the caller fails to do so (haven't checked the full stack trace yet to verify where that call came from).

I'm just in general confused on why we would need to lock at the tables entries for this user at all to get the ActiveSpectators / what this method is supposed to do.

Does the t.Spectators for some table have 'inactive' spectators in it that we somehow do not want to return here?
If so, why are the tables.spectating map and the t.Spectators attribute not kept in sync? I'd expect that a spectator has the table assigned in the map if and only if it is also an entry of the table itself

Depending on what the answers to these questions are, the possible solutions here might be:

Remove this method and just replace it with t.Spectators
Keep the two attributes in sync (if this is not the case yet and we want this)
Acquire the lock properly

Zamiell commented 1 year ago

@Krycke

Krycke commented 1 year ago

Yeah, the record of spectators on the table hold a list of all spectators that have been specific since the table was created. The spectator may or may not spectate right now.

The only trust worthy record of if a spectator is spectating actively right now is the record in tables that we get from the function tables.GetTablesUserSpectating(sp.UserID).

So now from the point of view of the table, we need to go through the list of all spectators, and see if any of them are spectating a table right now, and if they take appear to be this one.

This function in it self should not be the problem here, i have a hard time reading the stack trace on my phone, but i assume that there is a deadlock.

When i fixed this function, i was also confused about what we are locking and when, this function it self does not need the tables to be locked, if they are changed during checking it wouldn't matter, but this function uses functions that are might or might not need a lock.

The problem with finding dead locks in the stack trace is that it's not this thread that might be the problem, it's the thread that has claimed the lock the is suck somewhere that probably is the problem. But that one is not seen in the trace.

When i was trying to investigate the dead locks before i remember that the following could occur: thread 1 wants to first acquire lock 1 and while having that lock, it is trying to acquire lock 2. The second lock is already kept by a thread 2, so thread 1 gets stuck. At the same time a thread 3 is trying to acquire lock 1 and eventually times out. That's why we find the stack trace from the dead lock on thread 3 but i think the problem is actually that thread 2 won't let go of lock 2 or that thread 1 is trying to acquire a second lock before releasing the first one.

Either way i think all locks need to be checked in all functions, because something is strange with them.

kesslermaximilian commented 1 year ago

I think the function here should definitely have the lock for the tables struct when being executed. I can see why we wouldn't care if the spectators list changes while being read, but (presumably) these reads are not atomic, so stuff might go wrong if another thread writes these at the same time. The error message from the trace explicitly says 'concurrent read and write', so it could very much be that the stack trace we see is the thread reading from the tables struct, while some other thread (hopefully holding the lock for that) is currently writing it because some spectator joined.

I also don't see how this resembles a deadlock at the moment. I don't know go much, but for a deadlock, I would expect a differen terror message, because there is actually no read or write occurring if a deadlock happens.

Also, do you know why tables.GetTablesUserSpectating is the only reliable source for the active spectators? If a user leaves a table, why would we not update the list at the table itself, i.e. table.Spectators? To me it would seem that this should be what we want to have (do we ever need these past spectators?), so probably we should just fix that and not look at the tables struct here at all.

Regarding possible deadlocks in the Server code: Sure, there might be some, I have no idea and it would be quite a bit of work to check them all.

Krycke commented 1 year ago

Ok, as i said, i could read the error message on my phone.

Before my changes that was actually how it worked, the spectator was removed when the spectator left. But the problem if that the messages written by the spectator is also supported in this struct, so by removing it we also lose what a spectator has written, that's why i changed it to active and inactive spectators.

This was to solve at least two things: if a spectator leaves and comes back (even just reloads the page) the messages should stay. This is extra important when shadowing a game. Secondly when the game has finished, the spectators are remade and then the messages are lost.

Krycke commented 1 year ago

In the beginning of the redesign i was creating a separate list of active spectators, but then it could diverge from the table list. So it's better to have the information at only one place

Zamiell commented 1 year ago

but i assume that there is a deadlock.

To be clear, there is not a deadlock. Historically, when the server crashes, it is due to a deadlock, but not this time. This time, the error message is this:

fatal error: concurrent map read and map write

This means that a two goroutines are attempting to read and write to a map at the same time. Maps in Golang are not safe for concurrent use, so you are supposed to lock the corresponding map mutex first before doing any reads/writes.

Zamiell commented 1 year ago

@Krycke do you have time to try and fix this this weekend? i dont want the server to crash again

Krycke commented 1 year ago

Sorry, i don't have time coding at the moment, I will probably not sit in front of a computer for the foreseeable future.

Zamiell commented 1 year ago

I'll revert the PR then

Zamiell commented 1 year ago

Unless @kesslermaximilian wants to take a look?

Hanabi-Live / hanabi-live

Server Crash: Concurrent read/write on map #2855