ether / etherpad-lite

Etherpad: A modern really-real-time collaborative document editor.
http://docs.etherpad.org/
Apache License 2.0
16.72k stars 2.85k forks source link

Uncaught Error: Failed assertion: Invalid changeset (checkRep failed) #2107

Closed SkoricIT closed 4 years ago

SkoricIT commented 10 years ago

Hey guys. We are using stable and have the problem that some pads randomly stop working and throw an uncaught error in the console.

Uncaught Error: Failed assertion: Invalid changeset (checkRep failed) 

Example:

https://etherpad.tugraz.at/p/l3tsbet

When this happens, the "loading" overlay blocks any action. It's unlikely to be a copy&paste issue because it sometimes happens to entirely handwritten pads.

An interesting thing is, that the timeslider (opened by appending /timeslider to the url) always works without problems.

https://etherpad.tugraz.at/p/l3tsbet/timeslider

Right now we are manually fixing the pads by exporting+importing with HTML (losing all changesets). Any idea whats wrong?

JohnMcLear commented 10 years ago

Try the latest Develop branch and let us know if it still occurs

SkoricIT commented 10 years ago

This is not trivial as it occurs by chance on the prod server with many users and I cannot reproduce it. I can only test if broken pads stay broken on develop, if that helps.

JohnMcLear commented 10 years ago

yes please

SkoricIT commented 10 years ago

I will get the db copied to dev next week. Will report as soon as possible. Thank you for the fast response.

SkoricIT commented 10 years ago

Unfortunately moving to develop has not repaired the damaged pads. Interestingly, moving servers (simple sql export/import) has "repaired" one of the pads. It works on the new server (even on 1.3.0) but other damages pads don't. Still the same error.

This is a really strange bug and pads seem to sometimes just "self-repair" even on PROD and with no change whatsoever from us.

JohnMcLear commented 10 years ago

In theory this problem shouldn't occur at all as the bad data should never find it's way in now..

JohnMcLear commented 10 years ago

I can leave this open and if it occurs on fresh data we can try to resolve it..

SkoricIT commented 10 years ago

What do you mean with "fresh data" ? Do I need to put PROD on develop and try to get new broken pads?

JohnMcLear commented 10 years ago

Hrm, that'd be painful potentially.. Maybe wait for 1.4 which should be out in a few weeks max.

SkoricIT commented 10 years ago

We might do that. Thanks a lot.

usabilidoido commented 10 years ago

I also have that problem, with the same strange randomness. I updated to etherpad-lite 1.4 and it is still there. URL for one of the pads http://etherpad.usabilidoido.com.br:8080/p/07318a9b2b5666637d870fc50656323620af4df4

SkoricIT commented 10 years ago

I am right now in the process of upgrading to 1.4 and hopefully going live with the new version, so I can check if new defects come up in fresh pads after running on the new version.

@usabilidoido you can tell your users to export-import the pad. You can access controls by adding /timeslider to the url like this: http://etherpad.usabilidoido.com.br:8080/p/07318a9b2b5666637d870fc50656323620af4df4/timeslider

Export as html and then import into a new pad.

ericpedia commented 10 years ago

I am experiencing this issue in 1.4. On broken pads, command line shows [WARN] client - Uncaught TypeError: Cannot read property 'length' of undefined when browser shows the Failed assertion error.

Hat tip to @Ra1d3n for the /timeslider workaround. Glad to see pad content is still accessible there.

marcelklehr commented 10 years ago

This is just a hunch, but if you like, try with the experimental try/client-init-remove-checkRep branch I just created. (This is generally a dangerous thing to do, but it's worth a try, I think.)

SkoricIT commented 10 years ago

I have upgraded everything to 1.4. and yesterday we had a broken pad again. Might still be one that broke before update, but I am not sure. I will keep on looking.

@marcelklehr Unfortunately, I can't move production server to a dangerous version. And I can't mirror requests to a secondary server because I don't have the resources. :-/

marcelklehr commented 10 years ago

Sorry, I wasn't clear: Don't run try/client-init-remove-checkRep in production, but try to access the broken pads with etherpad running on that branch.

I removed checkRep in that branch, because I suspect that normalization may not work correctly in some cases. So, when a broken pad works on that branch without any problems at all, we've got to revisit this method.

SkoricIT commented 10 years ago

@marcelklehr I just did, and unfortunately it did not help.

Process:

git fetch
git checkout try/client-init-remove-checkRep
git status
On branch try/client-init-remove-checkRep
Your branch is up-to-date with 'origin/try/client-init-remove-checkRep'.

I confirmed the change is actually in the file system. (comment and new line are there) Error is still the same.

asd

I did get the error that @ericpedia mentioned, and it did not occur on pads other than the corrupted one.

console on server:

[2014-05-09 16:55:39.152] [WARN] client - Uncaught TypeError: Cannot read property 'length' of undefined -- { errorId: 'dTtndCRA5gonLZyvMlqw',
  msg: 'Uncaught TypeError: Cannot read property \'length\' of undefined',
  url: 'http://localhost:9001/p/OkTJWMYVNs',
  linenumber: 15,
  userAgent: 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36' }

When I kill the server the message is displayed in the client, also:

asd

I might be able to give you an SQL import of the broken pad, but you would need to tell me what exactly you need extracted from the database first. :-)

marcelklehr commented 10 years ago

The actual pad content would be very helpful, so that'd be db key pad:<PADID> (http://etherpad.org/doc/v1.4.0/#index_pad_padid) and perhaps a few of the last revisions.

marcelklehr commented 10 years ago

I think I'm getting closer:

checkPad says:

$ bin/checkPad <padID>
[WARN] console - DirtyDB is used. This is fine for testing but not recommended for production.
[ERROR] console - Bad changeset at revision 4901 - Failed assertion: mismatched apply: 11636 / 11635
[ERROR] console - Bad changeset at revision 11401 - Failed assertion: mismatched apply: 42094 / 42093
[ERROR] console - Bad changeset at revision 12301 - Failed assertion: mismatched apply: 48875 / 48874
[ERROR] console - Bad changeset at revision 13601 - Failed assertion: mismatched apply: 60227 / 60226
[INFO] console - finished

Mismatched apply in all cases means that the changeset in question expected one character less than actually present (so, one unexpected character). After examining the db dump, I found out that all of these changesets follow a revision with an atext attached to it (not all revisions include the full pad contents, but instead just the delta, however, every so often for performance reasons the full contents are attached).

Presumably, something went awry with this meta property either upon storing the revision or upon creating the current atext for a new pad client.

marcelklehr commented 10 years ago

I just rewrote checkPad to use the computed atext instead of the cached one and the pad passes. Even the cached atexts are correct! This makes me think there's a bug in some algorithm that computes the full atext for the client_vars!

SkoricIT commented 10 years ago

Nice work so far Marcel. Sounds like you are close to solving this.

marcelklehr commented 10 years ago

Ah, it's not the client_vars generator. The algorithm responsible for creating the cached atext is broken.

@Ra1d3n As a quickfix, you can revive your broken pads by deleting the atext meta property from all key revisions (where revNo % 100 == 0). "Re-inserting" all records related to a broken pad was reported as a fix for similar problems a long time ago -- now we know why it works :)

SkoricIT commented 10 years ago

@marcelklehr Will that make EP rebuild the pad from revision 0 on load? I guess I should not go around deleting all atext properties from all revisions then... :-) Any hope for a "regenerate Pad" script that rebuilds key revisions with a correct algorithm?

Gared commented 10 years ago

I've made some changes to the checkPad-script. See here: #1653 But this is more a dirty hack. If my script will resolve this problem I will write a script to fix pads

SkoricIT commented 10 years ago

Cool, I look forward to it. Right now we get only about one broken pad a week, so I am inclined to wait for a proper fix. :-) Thanks.

marcelklehr commented 10 years ago

Yea, I was thinking of a script.

marcelklehr commented 10 years ago

So, the diff between the computed atext and the cached one shows that: "концертов" somehow got turned into "ко��цертов". In another revision, "з" was turned into "��" as well...

I'm not sure what this is about. These chars are in the Unicode BMP, afaik, so I don't know what happened to them.

marcelklehr commented 10 years ago

The mutations in @Ra1d3n's pad occur in key revisions 4900, 11400, 12300 and 13600 (every 100th rev is a key rev, meaning it'll cache the full pad contents). Also, the AttributedText stored in the pad record is corrupt, too. All other key revisions are fine. (analyzed with this script)

The chars seem offly random. Nothing sticks out, really. I don't see a pattern.

I suspect there's something going wrong when storing the AttributedText in the database. Since sometimes the pad recovers in between broken key revisions, I'm guessing that the atext stored in memory is good. Sometimes, when it is stored in the db it somehow gets corrupted, though. If authors continue to edit the document until the next key revision is created and that revision doesn't end up getting corrupted, then nobody will notice anything.
However, if the server shuts down before managing to successfully save the valid AText held in memory to the database, the pad will be broken on the next server start.

SkoricIT commented 10 years ago

@marcelklehr We had etherpad crash and recover a few times because of a plugin that was not well coded, could this be the cause? Strange that it's just one letter that got corrupted every time.

Gared commented 10 years ago

I created a script to reinsert all database value of a pad. On my broken pads the script hasn't worked. @Ra1d3n You can download the script here: https://github.com/Gared/etherpad-lite/blob/pad-repair-script/bin/repairPad.js Make sure your etherpad instance isn't running while you're executing this script.

marcelklehr commented 10 years ago

@Gared Where do you get the values from in your script? I recommend working off my checkPad fork, tweaking it to insert the re-computed AttributedText values into the database.

@Ra1d3n Etherpad crashes are unlikely to cause data corruption, I'd say.

SkoricIT commented 10 years ago

@marcelklehr It probably just made the problem more visible, because the correct memory value was lost for us in those cases.

Gared commented 10 years ago

@marcelklehr This script is a fork from the extractPadData script. The values and keys are loaded in the function above. These are all keys of a pad.

usabilidoido commented 10 years ago

The repairPad.js script from @Gared is fixing my broken pads. Any chance of this fix being incorporated in the next etherpad-lite version?

JohnMcLear commented 10 years ago

Sure just send a pull request with it in or ask @Gared to

Gared commented 10 years ago

Opened pull request #2210

vieville commented 10 years ago

My etherpad is running on 1.4.1 and I have got 3 times the same problem described above : the pad is not able to load but /timeslider works well. 2 of them are now rerunning without doing nothing. On the third pad, I tried unsuccessfully the repairPad.js. Its url is :+1: http://portail.univ-lille1.fr/etherpad/link.jsp?groupID=g.jfobkKVrkydeTwLY&padID=SES_Grp8_Macroeconomie (you ave to click on "utilisez le pad avec l'invitexy" to access to the paditself.

Maybe there is a changeset with an anormal value which is not taken into account by repairPad or is there any new release of reparPad.js I have not seen on git ?

JohnMcLear commented 9 years ago

We think we have a fix for this now. Please pull develop and test :) Thanks!

akosiaris commented 8 years ago

Hello, we 've just had this occur. We run 1.5.7, that is the last released version. Backend is a MySQL database. I have no indications as to user actions that might have caused this.

Pad in question: https://etherpad.wikimedia.org/p/iOS-iteration-planning

Getting the pad data out via the timeslider trick works fine. The pad however will never load.

Anything in the database can be dumped and provided for debugging if it helps.

webzwo0i commented 8 years ago

hi, I had this discussion in the morning. 08:47 < webzwo0i> mutante: i debugged the pad. you normally should not do this, but if you have a database backup (and after making export/etherpad to have a backup of the pad) you can change three database entries and the pad should be working fine again. the three mysql commands are

mysql> update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS-iteration-planning";
mysql> update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS-iteration-planning:revs:7200";
mysql> update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS-iteration-planning:revs:7300";

08:49 < webzwo0i> can you check if your database is running utf8mb4 charset or utf8? 08:51 < webzwo0i> oha ups please dont apply the mysql-commands. maybe i was a little bit to fast :-) need to check something first 09:08 < webzwo0i> mh nope should be fine, please test... it would be good to know if you ran latest release or something else and which plugins you have enabled 09:47 < webzwo0i> I don't know if you ppl at wikimedia know each other but if you can find out who the user "Brian" is could you ask him what browser he is using? the reason is I can see what the bug is, but I cannot trigger it in my browser (only manually, but because you ppl are not hostile it was probably not on purpose) 09:49 < webzwo0i> (so we probably have two bugs, one server-side and one client-side)

webzwo0i commented 8 years ago

the info I need is if your database is utf8 or utf8mb4 I have extracted the offending changeset, the timeslider will not work around these revisions if you don't also apply

mysql> update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS-iteration-planning:revs:7105";

together with the updates from above this should make your pad reusable again

akosiaris commented 8 years ago

@webzwo0i

Yeah, I noticed that mutante talked to you. I just decided to follow up on the github issue as well. I 'll track down who "Brian" is and what browser they are using for you.

So, the store table is utf8mb4 for quite some time now. It was utf8 but we converted to utf8mb4 due to a different set of problems back in Jun. Specifically on Jun 23, 2015

https://github.com/ether/etherpad-lite/issues/2522#issuecomment-114441189 ;-)

webzwo0i commented 8 years ago

no need to find out the user/browser, I can reproduce the bug now. thank you!

Because you are on latest release you need to insert "charset":"utf8mb4" into settings.json inside dbSettings. This is now in settings.json.template. You can check if this is necessary with

SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

client (and maybe connection?) should indicate utf8mb4, if not your database-tables itself are capable of storing 4 byte utf8 but the server does not expect 4byte utf8 an the client-connection. This does not repair your old pads. You can however iterate over all your pads and use bin/checkPad.js to get an idea how many and which pads might have similar problems. Depending on the circumstances it can be very easy to repair (although some characters will be broken) like in the above case. If there are lots of broken pads it might make sense to automate this.

The reason these problems are not seen when people actually are writing is that most sites use ueberDB's internal cache for performance. This pure-javascript cache is totally aware of Unicode. As soon as the cache is flushed or etherpad is restarted it needs to get the entries from the database.

akosiaris commented 8 years ago

I 've repaired the pad as you 've instructed. Interesting to find out that 4 question marks in a row would corrupt a PAD in such a way. And that the corrupting changeset would be so old compared to the tip of the pad. But your explanation makes sense, thanks for that.

I 've updated the configuration with "charset": "utf8mb4" as well.

webzwo0i commented 8 years ago

I am following the phabricator ticket at wikimedia but don't have an account there so I post it here.

Your second broken pad can also be repaired using:

update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1120";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1254";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1216";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1108";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1106";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1200";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1300";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1400";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1500";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1600";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1700";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1800";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:1900";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:2000";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:2100";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:2200";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:2300";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives:revs:2400";
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:iOS_Retrospectives";

The 4 question marks are the symptom because, iirc, the single bytes in four-byte UTF8 are not valid UTF8. (In UTF8 only the first 127 chars are represented as single bytes, multibyte UTF8 probably uses bytes above 0x7f). So 4 question marks actually represent one 4byte encoded UTF8 string, which represents an code point outside the Basic Multilingual Plane (most probably an emoji :-D). In Javascript those code points would be encoded using UTF16's surrogate pairs, which are 2 16bit-values.

The checkRep-problem is that in changesets we not only store the characters but also the length of the change. Javascript's length() function, however, counts surrogate pairs as 2, so e.g. an emoji has length 2. When mysql decodes the string of a changeset to question marks than our representation of a changeset is not valid anymore.

Replacing it with two question marks is a hack not a real solution because we have no idea which code point the user entered in the first place (but as long as the value is in ueberDB's cache we could get it out from there).

webzwo0i commented 8 years ago

it would produce wrong results if someone actually uses four question marks (our length value would indicate 4, if we replace it with 2 question marks we would get a checkRep error in return) so if we would automate a repair script we would need to check if the string length after applying our change is conforming to the "how many chars are added"-value of the changeset.

to circument the problem if someone actually used four question marks and additionally code points like emoji we also need to track the position inside the document where we replace the question marks.

also note that not every checkRep-error is caused by broken encoding

akosiaris commented 8 years ago

And of course the above worked. AWESOME! I can not thank you enough. I am hoping that with the configuration fix in place this wont happen again.

I was wondering if the debugging you are doing is manually btw. Getting and comparing the changesets and their lengths one by one manually or if you have some more automated way of doing that. I suppose I can create one of my own anyway, just being curious

JohnMcLear commented 8 years ago

<3 @webzwo0i Thanks man you are awesome

@webzwo0i I have been doing a fair amount w/ getting the char rep from a X co-ord on an element lately (for drag and drop) and I have a feeling I'm going to hit this problem where certain characters are improperly wrapped. It will be interesting to test that at some point

caugner commented 6 years ago

The error can easily be reproduced by creating a new pad with a single emoji (e.g. 🐼) and restarting etherpad, see also #3340.

Update: As of April 2019, this single emoji itself doesn't break a pad, even after restarting.

caugner commented 6 years ago

I wanted to check all pads, so I added a checkAllPads tool (see PR #3342).