Closed SkoricIT closed 4 years ago
The error can easily be reproduced by creating a new pad with a single emoji (e.g. panda_face) and restarting etherpad, see also #3340.
I can not reproduce this, doing exactly what described above. See https://etherpad.wikimedia.org/p/ohmy for an example (yes I 've restarted etherpad multiple times already)
We just had a pad break with this error as well. Curiously, checkPad,js
does not find any problem, and repairPad.js
runs to completion without fixing it. Is there any way to determine which revision is at fault?
EDIT: Ah, I found https://gist.github.com/marcelklehr/a78d293571e7f06e3cf9 which pointed me the right way. Any chance this could be included in etherpad itself? It has been infinitely helpful right now, thanks a ton! (However, I had to replace console.log
by console.error
to even see any revision numbers. I have no nodeJS experience whatsoever, but I couldn't figure out another way to actually see all the logging.)
Indeed doing the "replace ????
by ??
" helped here as well. :) Seems like the last changeset was someone inserting an emoji (it ended in $????
).
However, I do not understand why this is classified as a "minor bug". This bug leads to total loss of a pad (until someone notices the /timeslider
thing, which took a week in our case, and even then history is lost).
Unassigned myself, as it's unlikely I'll get to fixing this. FWIW, this bug appears to be due a limitation of the easysync library, which I'm speculating does not to support all of utf-8. (UTF-8 may encode one character as multiple bytes, which each add to the length of a string in javascript, even though it's just one character.)
-- nevermind -- :D
FWIW, this bug appears to be due a limitation of the easysync library, which I'm speculating does not to support all of utf-8. (UTF-8 may encode one character as multiple bytes, which each add to the length of a string in javascript, even though it's just one character.)
Actually we have umlauts (äöü) in our pads all the time, which are also multi-byte in UTF-8. Based on what has been said above, I think the issue is actually about UTF-16 -- which, when originally designed, was intended to have exactly 2 bytes per character (codepoint, really), but now that we have more than 2^16 codepoints there are some that need 4 bytes, like emojis. And now length()
no longer matches the number of codepoints, and everything gets confused.
So maybe a better fix is to outright reject any surrogate pairs (4-byte codepoints)? That would make it impossible to use etherpad with characters from the supplementary plane, but that's likely broken anyways it seems? And it should protect the DB. There seem to be ways to test for surrogate pairs in JS (but I have zero experience in modern JavaScript).
Why did this get closed? To my knowledge, Etherpad still chokes on characters outside the BMP. I recently again had to manually repair a pad that got broken this way.
I closed it because I opened the Issue 2014 and was not interested in it anymore.
Well, it is still an open problem for others, so I'd appreciate if you could reopen.
Thanks! :)
Does anybody have any example for a character (sequence) that breaks a pad reliably? This would facilitate debugging I guess.
The Easysync library describes text (and its legth) in terms of "characters", but it was a minimum viable product from 10 years ago. Nowadays we should probably think in terms of NFC-normalized UTF-8 code points.
Just wondering, might we be able to solve the problem by storing the ueberdb values as binary blobs rather than in a collated text column?
Currently, if we try to put a byte sequence that is not valid utf8mb4 (think: a changeset that contains part of a multibyte character) into a utf8mb4
column, there are only two possible outcomes: either the database refuses the input, or client (or server) need to remove (think: replace with "?") the invalid "characters" or bytes before.
By using a binary blob column, the database would no longer care about the byte sequence being invalid utf8mb4, so we might avoid the character replacement. If easysync is as encoding agnostic as I understand, this could work (as long as two users don't insert multibyte characters AB and CD at the same position concurrently and these end up as individual changeset A, C, B, D - in this order -, rendering the merged result invalid utf8mb4).
PS: I just tested that inserting a 4-byte UTF8 character like 🍰 is not a problem itself (although: I didn't restart yet, which may be explanation), so I assume the bug either requires concurrency (leading to the character being split up in two or more changesets that are invalid on their own) or it requires a client emitting a changeset that removes part of such a character.
Hi, we are also experiencing this problem on a lot of pads.
I'm trying everything and just can't replace this with 🍰, I tried restarts, different database backends (that are properly configured)..
Can anyone provide steps to replicate with our more modern code base?
Hitting backspace on 🍰 does replace the item with � which is obviously sucky.
For me, replace(
value,'????','??')
has always worked so far. Hasn't happened for a few months though.
I included an updated version of Check Pad Deltas that works, if people can give that a try to see if it helps when experiencing this problem I'd appreciate it.
I still think the basic problem is that Etherpad data model thinks in terms of "characters" and not normalized UTF-8 code points.
Unless we rework the core library this will never be really solved. Obviously, any mitigation is useful. Just saying that there are no easy solutions that are guaranteed to be 100% correct in my opinion.
You'd be surprised just how many editors (and very popular ones with developers) have a similar experience to Etherpad tho. Playing around today I had some crazy experiences.
I included an updated version of Check Pad Deltas that works, if people can give that a try to see if it helps when experiencing this problem I'd appreciate it.
Pulled in in the master branch with #3717 (14ae2ee95094).
Hi, we are having a similar issue with one of our Pads. @JohnMcLear unfortunately the latest version of checkPadDeltas did not help :/
@gnd do you have a public instance?
Can you hit the padId/export/etherpad url and get the .etherpad file?
Are you running latest develop?
What's your database backend?
So many questions, please provide as much details as possible
@JohnMcLear Yes, its a public instance: https://pad.xpub.nl/p/CareCircle Unfortunately i get a 502 Bad Gateway error trying to get the .etherpad file We are running latest develop (git pull origin) on nodejs 12.16.3-1nodesource1, with the db backend being 10.3.22-MariaDB-0+deb10u1.
Im available today to help you with any sort of debugging you might want to do. I have already tried the last version of checkPadDeltas, however it just hangs for hours after start. This is the only output it gives:
All relative paths will be interpreted relative to the identified Etherpad base dir: /opt/etherpad [2020-05-05 00:04:12.330] [DEBUG] AbsolutePaths - Relative path "settings.json" can be rewritten to "/opt/etherpad/settings.json" [2020-05-05 00:04:12.346] [DEBUG] AbsolutePaths - Relative path "credentials.json" can be rewritten to "/opt/etherpad/credentials.json" settings loaded from: /opt/etherpad/settings.json No credentials file found in /opt/etherpad/credentials.json. Ignoring. [2020-05-05 00:04:12.369] [INFO] console - Using skin "no-skin" in dir: /opt/etherpad/src/static/skins/no-skin [2020-05-05 00:04:12.371] [INFO] console - Session key loaded from: /opt/etherpad/SESSIONKEY.txt [2020-05-05 00:04:12.541] [ERROR] console - table is not configured with charset utf8 -- This may lead to crashes when certain characters are pasted in pads [2020-05-05 00:04:12.543] [INFO] console - RowDataPacket { character_set_name: 'utf8mb4' } utf8
Dude, the error is in your log!
[2020-05-05 00:04:12.541] [ERROR] console - table is not configured with charset utf8 -- This may lead to crashes when certain characters are pasted in pads
[2020-05-05 00:04:12.543] [INFO] console - RowDataPacket { character_set_name: 'utf8mb4' } utf8
@JohnMcLear our db has
+----------------------------+------------------------+ | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME | +----------------------------+------------------------+ | utf8 | utf8_general_ci | +----------------------------+------------------------+
While the store table has
+--------------------+ | character_set_name | +--------------------+ | utf8mb4 | +--------------------+
So should i convert using
ALTER DATABASE
etherpad_lite_dbCHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
?
@JohnMcLear
The misconfiguration was twofold, the database was using utf8 and utf8_general_ci, but also in the settings.json the charset for the database was set as "utf8". Having fixed that all to utf8mb4 still didnt help, and the pad in question doesnt load, and the checkPadDeltas still hangs:
All relative paths will be interpreted relative to the identified Etherpad base dir: /opt/etherpad [2020-05-05 13:17:43.443] [DEBUG] AbsolutePaths - Relative path "settings.json" can be rewritten to "/opt/etherpad/settings.json" [2020-05-05 13:17:43.444] [DEBUG] AbsolutePaths - Relative path "credentials.json" can be rewritten to "/opt/etherpad/credentials.json" settings loaded from: /opt/etherpad/settings.json No credentials file found in /opt/etherpad/credentials.json. Ignoring. [2020-05-05 13:17:43.463] [INFO] console - Using skin "no-skin" in dir: /opt/etherpad/src/static/skins/no-skin [2020-05-05 13:17:43.464] [INFO] console - Session key loaded from: /opt/etherpad/SESSIONKEY.txt
@gnd It's a GiGo problem. Once you have garbage in, it can't be changed. Now all you know is the problem wont appear in the future!
@gnd It's a GiGo problem. Once you have garbage in, it can't be changed. Now all you know is the problem wont appear in the future!
Wouldn't repairPad.js
be able fix these broken pads?
Oh hi @caugner - sadly no, repairPad.js generally sucks and doesn't really work. https://github.com/ether/etherpad-lite/blob/develop/bin/repairPad.js#L48
The best thing I can suggest is to pull the atext/text out of the pad and bring it into a new pad.
@gnd I can write you a script to test to try and get the text if you want?
bin/extractPadData.js
with a change to output to stdout might be sufficient here.. 2mins I will create an extractPadText.js
@JohnMcLear that would be quite helpful indeed )
Use node bin/extractPadData.js $padid
Then cat $padid.db | grep \"text\" | grep revNum | tail -1
The text is the val.atext.text
item, you could json parse this at cli.. I will do that next if you need it.. For now do these commands making sure you replace $padid with your PadID
sudo apt-get install jq
to install jq then cat $padid.db | grep \"text\" | grep revNum | tail -1 | jq .val.atext.text
to see just the text.
To write the Pad text to a text file cat $padid.db | grep \"text\" | grep revNum | tail -1 | jq .val.atext.text > $padid.txt
Now you have the pad text you can just put that in a text file and import or or you setText API or whatever...
Lemme know if extraction fails and I will consider another approach.
The extraction is running, however it is quite slow. In the file CareCicle.db I see the latest line at revs:80, while the script already runs for 20m. The pad in questions has over 12k revisions..
Oh man, that sucks.. I guess it can't build the pad
object after 80 revisions.. It should only take 30 seconds or so for the script to run.
the last suggestion would be a big one, to dump the entire db and send it to me and then I can write a script to parse out what you need. Alternatively I can try to write a script here but there might be some back & forth to get it working that way.
Hi @JohnMcLear, the script has finally finished. I have no idea why it took so long (almost 40 hrs). Anyway, when looking into it, it seems to me, the whole exercise can be done by selecting the highest revision which is divisible by 100 from the store table and extracting the text from it ? In the future ill do this by hand :) Thanks a lot for your help
Exactly this, but I often get told off by our users when I make the assumption they can perform database queries so I try to avoid it. I think I know why it took so long btw, are you using MySQL @ Etherpad 1.8.3 ?
I'm using the latest master from git (not sure which version that is)
Assuming MySQL it's a known bug that we're due to have the patch land today.
yes sorry, its latest MariaDB - 10.3.22-MariaDB
@JohnMcLear im sorry to spam this ticket, but do you have an issue open for the MySQL patch you mentioned ? I want to see if our performance troubles with etherpad might be resolved by it.. thanks
No but just do npm install ueberdb@0.4.9 to fix
Btw the new logic for storing additional atext is in so this should be closed but if people experience an issue please do create a new issue and refer to this one. I want to deal with each individual cause of problem case-by-case with the main goal to create automated logic to restore a pad upon detected corruption in real time. That's the dream as corruption is inevitable.
This is a message for people getting to this recently (when upgrading from older versions of etherpad).
Today I upgraded an etherpad service from 1.6.3
to 1.8.6
(what a change!!!!! congratulations to all developers)
I had problems with one pad, the checkers (checkPad, checkAllPads, etc.) failed to detect it (or I don't know how to run node fine, anyway).
I verified the charset
is utf8mb4
in my settings.json (saw last version in settings.json.template
).
"dbType" : "mysql",
"dbSettings" : {
"user": "etherpaduser",
"host": "localhost",
"port": 3306,
"password": "PASSWORD",
"database": "etherpad_lite_db",
"charset": "utf8mb4"
},
for case https://pad.example.com/p/my-broken-pad I did:
mysql
update `store` set `value` = replace(`value`,'????','??') where `key` like "pad:my-broken-pad"
and it worked again :tada: :unicorn: :sparkles:
this solution was above (I put a +1 on previous messages with the solution to help find it), but I wanted to have it more clear
I guess one thing we could do here is check for ???? in pad contents and provide a warning that includes a suggested solution. @pedro-nonfree please could you submit a patch to checkPad.js or something then I'd happily merge that :)
This error occured with one single pad on an instance that was never upgraded and has been pinned to version 1.8.6 since initial deployment today. I fixed the issue, however I don't know what actually helped. First I tried the SQL query, that seemed not to help. Then I set the charset as an env variable on my kubernetes deployment, which redeployed the pod. I can't say if it was the charset or the SQL query in combination with the redeploy, but it's fixed now.
Hey guys. We are using stable and have the problem that some pads randomly stop working and throw an uncaught error in the console.
Example:
When this happens, the "loading" overlay blocks any action. It's unlikely to be a copy&paste issue because it sometimes happens to entirely handwritten pads.
An interesting thing is, that the timeslider (opened by appending /timeslider to the url) always works without problems.
Right now we are manually fixing the pads by exporting+importing with HTML (losing all changesets). Any idea whats wrong?