ether / etherpad-lite

Etherpad: A modern really-real-time collaborative document editor.
http://docs.etherpad.org/
Apache License 2.0
16.73k stars 2.86k forks source link

Non english characters are replaced with question marks sometimes (mysql) #526

Closed Pita closed 11 years ago

Pita commented 12 years ago

We found out today, that some non english charcacter doesn't get saved properly to the database. Some of them get replaced by a questionmark, some by two, sometimes it works fine. This only happens if you use a mysql backend

I was able to figure out an example that always fails. Set up an etherpad lite instance with mysql, copy paste this string привет into a pad, restart etherpad lite and reload the pad. You will see that your string got replaced by questionmarks

This results in ueberDB in this sql command

REPLACE INTO `store` VALUES ('pad:test3', '{\"atext\":{\"text\":\"Welcome to Etherpad Lite!\\n\\nThis pad text is syncприветhronized as you type, приветso that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!\\n\\nEtherpad Lite on Github: http://j.mp/ep-lite\\n\\n\",\"attribs\":\"|2+r+l*0+6+m*0+6|4+4e\"},\"pool\":{\"numToAttrib\":{\"0\":[\"author\",\"a.L7J3iR4yRjzJJJVc\"]},\"nextNum\":1},\"head\":2,\"chatHead\":-1,\"publicStatus\":false,\"passwordHash\":null}'),('pad:test3:revs:2', '{\"changeset\":\"Z:6i>6|2=r=l*0+6$привет\",\"meta\":{\"author\":\"a.L7J3iR4yRjzJJJVc\",\"timestamp\":1330620479659}}');

If I paste this sql command into phpmyadmin, the special chars gets replaced with question marks too. It looks like the node mysql driver doesn't escape this chars properly, or the database scheme can not work with such chars

Any help to solve that problem is appreciated

Pita commented 12 years ago

In the cases where these charachters gets replaced with two question marks, it breaks the pad :/

jhollinger commented 12 years ago

In mysql, each table has its own character set, and it usually defaults to latin1. This is probably causing the issue. (The charset can also be set for each connection. They should match.) The charset should probably be UTF-8. I think something like this should fix the db:

ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
JohnMcLear commented 12 years ago

Collation may need to be UTF8 not latin1_swedish

Try it on a dev box, this will require an ALTER *Glares at Peter..

-----Original Message----- From: Peter 'Pita' Martischka [mailto:reply@reply.github.com] Sent: 01 March 2012 17:00 To: John McLear Subject: [etherpad-lite] Non english characters are replaced with question marks sometimes (mysql) (#526)

We found out today, that some non english charcacter doesn't get saved properly to the database. Some of them get replaced by a questionmark, some by two, sometimes it works fine. This only happens if you use a mysql backend

I was able to figure out an example that always fails. Set up an etherpad lite instance with mysql, copy paste this string привет into a pad, restart etherpad lite and reload the pad. You will see that your string got replaced by questionmarks

This results in ueberDB in this sql command

REPLACE INTO `store` VALUES ('pad:test3', '{\"atext\":{\"text\":\"Welcome to Etherpad Lite!\\n\\nThis pad text is syncприветhronized as you type, приветso that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!\\n\\nEtherpad Lite on Github: http://j.mp/ep-lite\\n\\n\",\"attribs\":\"|2+r+l*0+6+m*0+6|4+4e\"},\"pool\":{\"numToAttrib\":{\"0\":[\"author\",\"a.L7J3iR4yRjzJJJVc\"]},\"nextNum\":1},\"head\":2,\"chatHead\":-1,\"publicStatus\":false,\"passwordHash\":null}'),('pad:test3:revs:2', '{\"changeset\":\"Z:6i>6|2=r=l*0+6$привет\",\"meta\":{\"author\":\"a.L7J3iR4yRjzJJJVc\",\"timestamp\":1330620479659}}');

If I paste this sql command into phpmyadmin, the special chars gets replaced with question marks too. It looks like the node mysql driver doesn't escape this chars properly, or the database scheme can not work with such chars

Any help to solve that problem is appreciated


Reply to this email directly or view it on GitHub: https://github.com/Pita/etherpad-lite/issues/526 This email and its attachments may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of the organisation from which this email originated. If you are not the intended recipient of this email and its attachments, you must take no action based upon them, nor must you copy or show them to anyone. Please contact the sender if you believe you have received this email in error. This email was sent by School Email - Safe Webmail and Hosted Email for Schools

Pita commented 12 years ago

Maybe we can solve that without an alter. Its fascinating how much you can do wrong at a 2 column table. I should get an award for that ;)

JohnMcLear commented 12 years ago

Hah, I think I would of got more things wrong ;)

-----Original Message----- From: Peter 'Pita' Martischka [mailto:reply@reply.github.com] Sent: 01 March 2012 17:52 To: John McLear Subject: Re: [etherpad-lite] Non english characters are replaced with question marks sometimes (mysql) (#526)

Maybe we can solve that without an alter. Its fascinating how much you can do wrong at a 2 column table. I should get an award for that ;)


Reply to this email directly or view it on GitHub: https://github.com/Pita/etherpad-lite/issues/526#issuecomment-4263155 This email and its attachments may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of the organisation from which this email originated. If you are not the intended recipient of this email and its attachments, you must take no action based upon them, nor must you copy or show them to anyone. Please contact the sender if you believe you have received this email in error. This email was sent by School Email - Safe Webmail and Hosted Email for Schools

Pita commented 12 years ago

The most interesting thing is that it sets the value correct in pad:$name:revs:$number, but not in pad:$name ... that why I'm not sure if an alter helps at all

0ip commented 12 years ago

@Pita , can you set UTF-8 as default collation in ueberDB and while you're at it, pull the redis handler (really useful)? here probably :)

This would solve this issue.

JohnMcLear commented 12 years ago

@pita can we get an update on this please?

brainysmurf commented 12 years ago

I'm using etherpad lite (thank you guys!) in China with users who write in Korean, Chinese, and even Japanese. I'm just looking for a command to execute in mysql and any other related procedures that will solve this for my users. Could someone be any assistance?

JohnMcLear commented 12 years ago
ALTER TABLE store CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

replace store with whatever your table name is but afaik this problem is a legacy problem and shouldn't exist.

Are you running latest?

brainysmurf commented 12 years ago

Changelog says I'm on v1.1.5. I also see in the changelog that it supports postgres now.

I get the following whenever a user types in a Korean, Chinese, or Japanese character:

[2012-11-09 10:54:17.702] [ERROR] console - TypeError: Cannot read property 'rev' of undefined at /home/etherpad/etherpad-lite/src/node/handler/PadMessageHandler.js:622:48 at Object.whilst (/home/etherpad/etherpad-lite/src/node_modules/async/lib/async.js:553:13) at /home/etherpad/etherpad-lite/src/node/handler/PadMessageHandler.js:621:11 at /home/etherpad/etherpad-lite/src/node_modules/async/lib/async.js:86:13 at Array.forEach (native) at /home/etherpad/etherpad-lite/src/node_modules/async/lib/async.js:26:24 at Object.forEach (/home/etherpad/etherpad-lite/src/node_modules/async/lib/async.js:85:9) at Object.updatePadClients (/home/etherpad/etherpad-lite/src/node/handler/PadMessageHandler.js:616:9) at /home/etherpad/etherpad-lite/src/node/handler/PadMessageHandler.js:598:15

Any thoughts?

marcelklehr commented 12 years ago

@brainysmurf pls open a new issue for this ;)

JohnMcLear commented 11 years ago

Related to https://github.com/ether/etherpad-lite/issues/842 ?

JohnMcLear commented 11 years ago

Get latest (using dirty), create new pad, paste in

百度一下,你就知道

hit timeslider, click play hit back

All fine.

Change database to MySQL

create new pad, paste in

百度一下,你就知道

hit timeslider, click play hit back

All fine.

gedion commented 11 years ago

Hello. Oracle database has the same issue. From my understanding, however oracle only allows setting of such encoding only at the database level. This above solution is not feasible for us. We came up with the below solution which seems to work but we are not sure of the consequences.

in our oracle_db.js, before storing to db, we escape and encode all data as bellow.

exports.database.prototype.strencode = function(data){
  return unescape( encodeURIComponent(  data ) );
};

when getting it, we decode it

exports.database.prototype.strdecode = function(data){      
    return  decodeURIComponent( escape ( data ) ) ; 
};

What do you guys think?
marcelklehr commented 11 years ago

Not taking into account the possible additional magic that encodeURIComponent performs, this might be a solution.