ether / etherpad-lite

Etherpad: A modern really-real-time collaborative document editor.
http://docs.etherpad.org/
Apache License 2.0
16.73k stars 2.86k forks source link

Loss of Sync Between Whiteboards, Forcing Service Restart #1895

Closed rchllc closed 11 years ago

rchllc commented 11 years ago

Hi.

We've got a pretty substantial problem that we've been unable to diagnose. We're using etherpad-lite as a platform to allow tutor and student to connect in a virtual tutoring environment. We've been using etherpad for about a year, migrating up in version with each new release, but keeping the same database.

The issue is that, randomly, but several times a day, the whiteboard will lose it's "sync." Etherpad thinks that it's up, doesn't throw any errors, and doesn't restart. But, if you type in one whiteboard, that text doesn't show up in the other, connected whiteboard. And, if this happens to one pad, all connected pads exhibit the same behavior.

When this happens, the only solution is to restart.

We've ensured that we're running the latest version of etherpad, node.js, and we've checked all dependencies. We've also checked everything that I can figure out how to check, and have confirmed that the server has more than enough memory and disk space. Honestly, we're completely out of ideas.

I'm happy to provide any diagnostic information, to try anything, or to do anything to get this working. This only started about a month ago (I believe with an upgrade to etherpad), but I can't figure out where to go from here.

Thanks for any help that can be provided.

Richard

JohnMcLear commented 11 years ago

I can confirm this does happen, I have witnessed it before, it's a horrible bug and needs resolving ASAP


From: rchllc [notifications@github.com] Sent: 18 September 2013 23:00 To: ether/etherpad-lite Subject: [etherpad-lite] Loss of Sync Between Whiteboards, Forcing Service Restart (#1895)

Hi.

We've got a pretty substantial problem that we've been unable to diagnose. We're using etherpad-lite as a platform to allow tutor and student to connect in a virtual tutoring environment. We've been using etherpad for about a year, migrating up in version with each new release, but keeping the same database.

The issue is that, randomly, but several times a day, the whiteboard will lose it's "sync." Etherpad thinks that it's up, doesn't throw any errors, and doesn't restart. But, if you type in one whiteboard, that text doesn't show up in the other, connected whiteboard. And, if this happens to one pad, all connected pads exhibit the same behavior.

When this happens, the only solution is to restart.

We've ensured that we're running the latest version of etherpad, node.js, and we've checked all dependencies. We've also checked everything that I can figure out how to check, and have confirmed that the server has more than enough memory and disk space. Honestly, we're completely out of ideas.

I'm happy to provide any diagnostic information, to try anything, or to do anything to get this working. This only started about a month ago (I believe with an upgrade to etherpad), but I can't figure out where to go from here.

Thanks for any help that can be provided.

Richard

— Reply to this email directly or view it on GitHubhttps://github.com/ether/etherpad-lite/issues/1895.

marcelklehr commented 11 years ago

So, i deduce the issue is on the server side.. Could you provide some logs that show what happened when the connectivity died?

rchllc commented 11 years ago

Hi John and Marcel.

John, it's good to hear (in a really bad way, of course) that you've seen this issue, as it makes us seem less "alone" in this problem.

Marcel, the issue is server side. But, one of the things that is making this so hard to figure out is that etherpad "thinks" that everything is okay even after the loss of sync. Additionally, the logs aren't showing what is causing the disconnect.

For example, here is the log from when one disconnect occurred. All of these errors are client side, but during this period, the whiteboard for all pads became "unsynced."

[2013-09-04 20:05:16.188] [ERROR] console - CLIENT SIDE JAVASCRIPT ERROR: {"errorId":"Tm3vUq5laihE0cf3y0ix","msg":"TypeError: 'null' is not an object","url":"http://XXXXXXX/p/XXXPAD","linenumber":0,"userAgent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.51.22 (KHTML, like Gecko) Version/5.1.1 Safari/534.51.22"}
<te_export_918718535.html /tmp/eplite_export_918718535.doc doc               rror: convert /tmp/eplite_export_918718535.html /tmp/eplite_export_918718
[2013-09-04 21:16:33.960] [ERROR] console - CLIENT SIDE JAVASCRIPT ERROR: {"errorId":"VDkvUv9k9x5tZM0koJKY","msg":"ReferenceError: Can't find variable: chat","url":"http://XXXXXXX/p/XXXPAD","linenumber":1,"userAgent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.28.10 (KHTML, like Gecko) Version/6.0.3 Safari/536.28.10"}
[2013-09-04 21:21:34.853] [ERROR] console - CLIENT SIDE JAVASCRIPT ERROR: {"errorId":"fZFGpcidx71RPPYbJnYo","msg":"ReferenceError: Can't find variable: chat","url":"http://XXXXXXX/p/XXXPAD","linenumber":1,"userAgent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1"}

A day later, after another disconnect, we found the following log entry:

[2013-09-05 13:50:23.738] [INFO] console - DIAGNOSTIC-INFO: {"disconnectedMessage":"slowcommit","padId":"XXXPAD","socket":{"sessionid":"I7g6DzNknjTD5ALZGDOV","closeTimeout":60000,"heartbeatTimeout":60000,"heartbeatTimeoutTimer":5,"connectTimeoutTimer":8}}

Unfortunately, while the pads all disconnect several times a day, we can't find anything consistent in the logs. I am willing to provide, of course, any information that you need... and I do truly appreciate your help. Just let me know what to do to help you!

Richard

ldidry commented 11 years ago

Hi. We, at the Framasoft association, have exactly the same problem !

We used to run an old version of etherpad and node.js and wanted to update. A server crash made us reinstall all the stuff in a hurry and we did the update in the same time (because we wanted to take advantage of the forced downtime).

So, we got the problem and thought it was due to our high level of pads (more than 55,000 at http://lite.framapad.org) and our MySQL database (we encountered some scalability problems with MySQL on other apps).

We decided to create other instances of etherpad, with MongoDB instead. Seemed to work, but encountered the problem later. We thought that the problem was to Apache as reverse proxy and took Nginx instead. Again, it looked good for a few hours/days (don't remember exactly) => we took Varnish, looked good again for a few days and today : bug again.

I was planning to open this issue in the evening (didn't see this one, although I looked over and over).

So, if you need any help, testing, logs, anything that can help you troubleshooting this damn' bug, I'll be happy to contribute (unfortunately, I'm not very comfortable with node.js programming).

marcelklehr commented 11 years ago

It would probably help, if you could do a git bisect

rchllc commented 11 years ago

Hi Marcel.

I have no idea how to do a "git bisect." I'm happy to do so, though, so if you can point me in the right direction, I'll get to work. I'm willing to do about anything to resolve this issue!

Richard

marcelklehr commented 11 years ago

git bisect is a tool provided by git that helps you find the foul commits in your history using binary search.

you just type in git bisect and it'll tell you what to do. The general idea is that you specify a commit that doesn't have the bug in it and a commit that definitely has the bug in it. It will then checkout various commits in between of those two, and ask you whether it is a 'good one' or a 'bad one'. Eventually it'll figure out the commit introducing the bug for you. (If you have problems just ask or google it)

JohnMcLear commented 11 years ago

I think i did a video tutorial for how to use git bisect on etherpad, search online

rchllc notifications@github.com wrote:

ldidry commented 11 years ago

Well, the problem is that the problem does'nt appear all the time.

I'm trying to git bisect, but the bug occured only… on a commit that updated the README.md :(

JohnMcLear commented 11 years ago

Quantum bugs are the worst...

Luc Didry notifications@github.com wrote:

Well, the problem is that the problem does'nt appear all the time.

I'm trying to git bisect, but the bug occured only… on a commit that updated the README.md :(

— Reply to this email directly or view it on GitHubhttps://github.com/ether/etherpad-lite/issues/1895#issuecomment-24766609.

marcelklehr commented 11 years ago

Ok, so we first need to find out how to consistently reproduce the bug..

marcelklehr commented 11 years ago

What happens when you reload the pad after noticing that the two pads are out of sync?

JohnMcLear commented 11 years ago

afaik: default pad text if its a new pad

So if you create a new pad, start typing, f5, you see default pad text..