Sometimes when the Czar confirms the winner the next round never starts.

LJNielsenDk commented 10 years ago

Been noticing this sporadically for a few days, but it is happening a lot right now.

This happens to the whole game, not just one player not getting updated. The chat still works.

Reloading does nothing. The Czar leaving does nothing other than appoint a new Czar who then has the same cards to pick from but can't actually pick.

ajanata commented 10 years ago

Sometimes, for a reason I have yet to figure out, the background timer processor thread croaks, which kills basically everything that runs automatically. This includes kicking users for being idle, removing users which are no longer in contact with the server ('ping timeout', likely because the just closed the tab), and advancing game state. Once it happens, the only way to fix it is to restart the affected server. pyx-2 got into this state sometime around 1:14 PM PST today. I'm trying to do a bit of post-mortem on it and then will restart it.

Oddly, the thread is still in a thread dump, and I don't see any exceptions in the log. It looks like it might be stuck inside Hibernate, but I can't figure out why from this:

"timer-task" daemon prio=10 tid=0x00007f7d54117800 nid=0x6ede runnable [0x00007f7d8446c000]
   java.lang.Thread.State: RUNNABLE
        at net.socialgamer.cah.db.CardSet_$$_javassist_0.getHibernateLazyInitializer(CardSet_$$_javassist_0.java)
        at org.hibernate.engine.StatefulPersistenceContext.clear(StatefulPersistenceContext.java:212)
        at org.hibernate.impl.SessionImpl.cleanup(SessionImpl.java:615)
        at org.hibernate.impl.SessionImpl.close(SessionImpl.java:343)
        at net.socialgamer.cah.data.Game.removePlayer(Game.java:301)
        at net.socialgamer.cah.data.User.noLongerVaild(User.java:202)
        at net.socialgamer.cah.data.ConnectedUsers.checkForPingAndIdleTimeouts(ConnectedUsers.java:190)
        at net.socialgamer.cah.UserPing.process(UserPing.java:51)
        at net.socialgamer.cah.SafeTimerTask.run(SafeTimerTask.java:15)
        at java.util.TimerThread.mainLoop(Timer.java:534)
        at java.util.TimerThread.run(Timer.java:484)

This does not seem to be changing between multiple dumps.

I suppose when I have time I can try to figure out why this could be happening, but considering how rare it is (once every few weeks between the 3 severs), I'm not moving it to the top of the priority list, especially since there are other servers which still work.

uecasm commented 10 years ago

Could be a thread-safety thing. I don't think Hibernate sessions are safe to pass between threads, so if the timer is reusing an externally provided session instead of making its own one, it could be breaking it. (I haven't checked the code to verify this.)

ajanata commented 10 years ago

Maybe. I'm working on refactoring Hibernate sessions a bit to reduce lifetime of them, and we'll see if that helps with this. I'll also see about switching to a ScheduledThreadPoolExecutor so there are multiple threads capable of running timer events. That won't fix the problem, but it could delay the symptoms until all threads become locked.

ajanata commented 6 years ago

This doesn't seem to have been happening anymore.

ajanata / PretendYoureXyzzy

Sometimes when the Czar confirms the winner the next round never starts. #89