embulk / embulk-output-jdbc

MySQL, PostgreSQL, Redshift and generic JDBC output plugins for Embulk
Other
88 stars 86 forks source link

Failed to output chinese emoji to mysql #306

Open YunLongND opened 2 years ago

YunLongND commented 2 years ago

My mysql uses code utf8mb4, and mysq's temporary table code is utf8mb4, When I insert data into mysql temporary table, it has the following error: org.embulk.exec.PartialExecutionException: java.lang.RuntimeException: java.sql.BatchUpdateException: Incorrect string value: '\xF0\x9F\xA4\xA3' for column ......

Can someone help me solve this problem? thanks.

hiroyuki-sato commented 2 years ago

Hello, @YunLongND

I could insert the following emoji into the MySQL database without any special configuration. (I created the table using mysql command) I also tested without a creation table.

Could you tell me more detail about your environment (configuration, database version, and so on)?

CREATE TABLE `example` (
  `id` bigint DEFAULT NULL,
  `account` bigint DEFAULT NULL,
  `time` timestamp NULL DEFAULT NULL,
  `purchase` timestamp NULL DEFAULT NULL,
  `comment` text COLLATE utf8mb4_general_ci
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,🍺🍻🍣
in:
  type: file
  path_prefix: sample_
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out:
  type: mysql
  host: localhost
  mode: insert
  user: user
  password: password
  database: embulk_test
  table: example
mysql> show variables like '%char%';
+--------------------------+--------------------------------------------------------+
| Variable_name            | Value                                                  |
+--------------------------+--------------------------------------------------------+
| character_set_client     | utf8mb4                                                |
| character_set_connection | utf8mb4                                                |
| character_set_database   | utf8mb4                                                |
| character_set_filesystem | binary                                                 |
| character_set_results    | utf8mb4                                                |
| character_set_server     | utf8mb4                                                |
| character_set_system     | utf8mb3                                                |
| character_sets_dir       | /usr/local/Cellar/mysql/8.0.28_1/share/mysql/charsets/ |
+--------------------------+--------------------------------------------------------+
8 rows in set (0.00 sec)
YunLongND commented 2 years ago

@hiroyuki-sato Thank you very much for your attention to my question. My configuration information is as follows:

I noticed that when I use mode truncate_insert to transfer data from Redshift to MySQL, embed first creates a temporary table in mysql, then stores the data in the temporary table, and then transfers the data in the temporary table to the target table. My chinese emoji have been successfully stored in redshift, and the temporary table has been successfully created, and the code of the temporary table is utf8mb4. However, when I insert the emoji data in redshift into the temporary table of MySQL, the above error occurs. If you need more information, please let me know.

hiroyuki-sato commented 2 years ago

Hello, @YunLongND

I got the same error message if the target table does not support utf8mb4. In my case, I altered changed the database character set and recreate the table. After that, I succeed insert data.

truncate_insert does not recreate the target table. Did you check the table schema using show create table mysql_table?

I inserted the same data using mode: truncate_insert.

YunLongND commented 2 years ago

@hiroyuki-sato I have checked that both my temporary table and my target table are coded utf8mb4. And using SQL directly to insert emoji into the table can be successful. Based on the above communication, we cannot confirm where the problem is, so if there is no better suggestion on this issue, let's leave it here for now. I found a little bit of a similar problem in other embulk projects, I don't know if it's related. https://github.com/embulk/embulk-input-s3/issues/75 Thank you again for your efforts.

hiroyuki-sato commented 2 years ago

Hello, @YunLongND

I succeed insert emoji in the following environment and configuration.

[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci

[client]
default-character-set=utf8mb4
id,comment
2,embulk jruby
4,🍺🍻🍣
in:
  type: file
  path_prefix: sample_2
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: comment, type: string}
out:
  type: mysql
  host: 127.0.0.1
  mode: truncate_insert
  user: user
  password: password
  database: embulk_test
  table: example

docker-compose.yml

   version: '3'

   services:
     db:
       image: mysql:5.7
       container_name: mysql_container
       environment:
         MYSQL_ROOT_PASSWORD: password
         MYSQL_DATABASE: embulk_test
         MYSQL_USER: user
         MYSQL_PASSWORD: password
       volumes:
       - ./docker/db/data:/var/lib/mysql
       - ./docker/db/my.cnf:/etc/mysql/conf.d/my.cnf
       restart: always
       ports:
       - 3306:3306
rajyan commented 2 years ago

I had the exact same problem, and reading the documents, it seems not possible to solve the problem only by setting connector options.

https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-conversion.html

For Connector/J 8.0.12 and earlier: In order to use the utf8mb4 character set for the connection, the server MUST be configured with character_set_server=utf8mb4; if that is not the case, when UTF-8 is used for characterEncoding in the connection string, it will map to the MySQL character set name utf8, which is an alias for utf8mb3.

setting

{ useUnicode: true, characterEncoding: UTF-8 }

using mysql-connector-java:5.1.44 would use utf8mb3.

https://stackoverflow.com/questions/44591895/utf8mb4-in-mysql-workbench-and-jdbc

Starting from MySQL Connector/J 5.1.47,

When UTF-8 is used for characterEncoding in the connection string, it maps to the MySQL character set name utf8mb4.

You can check docs here

I think upgrading the driver version to 5.1.47+ or 8.0.13+ can solve the issue.

Current solution is to set the my.cnf correctly as the document says.

hiroyuki-sato commented 2 years ago

Hello, @rajyan

You can change Connecotor/J to 5.1.47+ (not Connector/J 8.x) using driver_path option. Could you try it? Could you provide the reproduce configuration/steps like this?

I think 🍣 is the four-byte UTF-8 characters.

echo -n '🍣' | od -t x1
0000000    f0  9f  8d  a3
0000004

About updating the driver.

https://github.com/embulk/embulk-input-jdbc/pull/237#issuecomment-1202003618

We'll eventually want to update JDBC drivers, but actually, JDBC drivers often have silent incompatibility between versions. We have hesitated to update the "default" JDBC driver there, then. (E.g. default options, ...) Instead, you should be able to switch the JDBC driver version with the driver_path option by yourself, without rebuilding the plugin. Please try that for a while.

See also: https://github.com/embulk/embulk-input-jdbc/issues/238

rajyan commented 2 years ago

Hi, @hiroyuki-sato

Thank you for noticing driver_path option!

I created a reproducible repo for the error. https://github.com/rajyan/embulk-mysql-utf8

The error occurred by setting options: { characterEncoding: UTF-8 }, and using Connecotor/J to 5.1.49 didn't help. (maybe we can't change the charset to utf8mb4 by connector option?) Only setting mysqld --character-set-server=utf8mb4 --collation-server=utf8mb4_general_ci solved the issue. edit: updated the repro correctly

rajyan commented 2 years ago

Sorry, I understood my problem and maybe it was not related. The unicode error happens when inserting to the intermediate table (because the default charset is not utf8mb4 the columns of the intermediate tables are not utf8mb4)

rajyan commented 2 years ago

Updated the repro by adding create_table_option: DEFAULT CHARSET=utf8mb4 https://github.com/rajyan/embulk-mysql-utf8/commit/80880d887fc6dd7b9a613e2645289dc03d30eacd

I could confirm that the error only happens in the current connector, and upgrading Connecotor/J to 5.1.49 can solve the error. Although the inserted 4 byte chars are broken even using 5.1.49 ...

rajyan commented 2 years ago

Although the inserted 4 byte chars are broken even using 5.1.49 ...

Sorry never mind. It was just my mysql client.

I can confirm that upgrading to Connecotor/J to 5.1.49 can solve the error! πŸ˜„ https://github.com/rajyan/embulk-mysql-utf8/commit/0b32bcf8bfee08249b6c2433870ed626fa794854

rajyan commented 2 years ago

Thank you for https://github.com/embulk/embulk-input-jdbc/issues/238 and https://github.com/embulk/embulk-input-jdbc/pull/237#issuecomment-1202003618

because there are several simple work arounds/solutions like using driver_path or setting server/default charsets, I’m not that in trouble with this issue now:+1:

rajyan commented 2 years ago

Summary of the problem You'll get

org.embulk.exec.PartialExecutionException: java.lang.RuntimeException: java.sql.BatchUpdateException: Incorrect string value: '\xF0\x9F\xA4\xA3' for column ......

error if you are not setting utf8mb4 in the mysql database default charset.

This error occurs even you are setting

because Connector/J before version 5.1.47 uses utf8 (utf8mb3) as a default for the connection with options : { useUnicode: true, characterEncoding: UTF-8 }.

The solution now is