Open YunLongND opened 2 years ago
Hello, @YunLongND
I could insert the following emoji into the MySQL database without any special configuration.
(I created the table using I also tested without a creation table.mysql
command)
Could you tell me more detail about your environment (configuration, database version, and so on)?
CREATE TABLE `example` (
`id` bigint DEFAULT NULL,
`account` bigint DEFAULT NULL,
`time` timestamp NULL DEFAULT NULL,
`purchase` timestamp NULL DEFAULT NULL,
`comment` text COLLATE utf8mb4_general_ci
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,πΊπ»π£
in:
type: file
path_prefix: sample_
parser:
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
escape: '"'
null_string: 'NULL'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out:
type: mysql
host: localhost
mode: insert
user: user
password: password
database: embulk_test
table: example
mysql> show variables like '%char%';
+--------------------------+--------------------------------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /usr/local/Cellar/mysql/8.0.28_1/share/mysql/charsets/ |
+--------------------------+--------------------------------------------------------+
8 rows in set (0.00 sec)
@hiroyuki-sato Thank you very much for your attention to my question. My configuration information is as follows:
in: type: redshift host: redshift_host user: redshift_user password: redshift_password database: redshift_database port: 5439 fetch_rows: 1000 query: select * from schema.table options: {useUnicode: true, characterEncoding: utf8} out: type: mysql host: mysql_host port: 3306 user: mysql_user password: mysql_password database: mysql_database schema: mysql_schema table: mysql_table create_table_option: DEFAULT CHARSET=utf8mb4 options: {loglevel: 2, useUnicode: true, characterEncoding: utf8} mode: truncate_insert node_count: 16
I noticed that when I use mode truncate_insert to transfer data from Redshift to MySQL, embed first creates a temporary table in mysql, then stores the data in the temporary table, and then transfers the data in the temporary table to the target table. My chinese emoji have been successfully stored in redshift, and the temporary table has been successfully created, and the code of the temporary table is utf8mb4. However, when I insert the emoji data in redshift into the temporary table of MySQL, the above error occurs. If you need more information, please let me know.
Hello, @YunLongND
mysql_table
using mysql
command? (ex. 1insert into mysql_table values(1,"π»π£πΊ") ;`mode: insert
?I got the same error message if the target table does not support utf8mb4
.
In my case, I altered changed the database character set and recreate the table.
After that, I succeed insert data.
truncate_insert
does not recreate the target table.
Did you check the table schema using show create table mysql_table
?
I inserted the same data using mode: truncate_insert
.
@hiroyuki-sato I have checked that both my temporary table and my target table are coded utf8mb4. And using SQL directly to insert emoji into the table can be successful. Based on the above communication, we cannot confirm where the problem is, so if there is no better suggestion on this issue, let's leave it here for now. I found a little bit of a similar problem in other embulk projects, I don't know if it's related. https://github.com/embulk/embulk-input-s3/issues/75 Thank you again for your efforts.
Hello, @YunLongND
I succeed insert emoji in the following environment and configuration.
[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
[client]
default-character-set=utf8mb4
id,comment
2,embulk jruby
4,πΊπ»π£
in:
type: file
path_prefix: sample_2
parser:
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
escape: '"'
null_string: 'NULL'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: comment, type: string}
out:
type: mysql
host: 127.0.0.1
mode: truncate_insert
user: user
password: password
database: embulk_test
table: example
docker-compose.yml
version: '3'
services:
db:
image: mysql:5.7
container_name: mysql_container
environment:
MYSQL_ROOT_PASSWORD: password
MYSQL_DATABASE: embulk_test
MYSQL_USER: user
MYSQL_PASSWORD: password
volumes:
- ./docker/db/data:/var/lib/mysql
- ./docker/db/my.cnf:/etc/mysql/conf.d/my.cnf
restart: always
ports:
- 3306:3306
I had the exact same problem, and reading the documents, it seems not possible to solve the problem only by setting connector options.
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-conversion.html
For Connector/J 8.0.12 and earlier: In order to use the utf8mb4 character set for the connection, the server MUST be configured with character_set_server=utf8mb4; if that is not the case, when UTF-8 is used for characterEncoding in the connection string, it will map to the MySQL character set name utf8, which is an alias for utf8mb3.
setting
{ useUnicode: true, characterEncoding: UTF-8 }
using mysql-connector-java:5.1.44
would use utf8mb3.
https://stackoverflow.com/questions/44591895/utf8mb4-in-mysql-workbench-and-jdbc
Starting from MySQL Connector/J 5.1.47,
When UTF-8 is used for characterEncoding in the connection string, it maps to the MySQL character set name utf8mb4.
You can check docs here
I think upgrading the driver version to 5.1.47+ or 8.0.13+ can solve the issue.
Current solution is to set the my.cnf
correctly as the document says.
Hello, @rajyan
You can change Connecotor/J to 5.1.47+ (not Connector/J 8.x) using driver_path
option. Could you try it?
Could you provide the reproduce configuration/steps like this?
I think π£ is the four-byte UTF-8 characters.
echo -n 'π£' | od -t x1
0000000 f0 9f 8d a3
0000004
About updating the driver.
https://github.com/embulk/embulk-input-jdbc/pull/237#issuecomment-1202003618
We'll eventually want to update JDBC drivers, but actually, JDBC drivers often have silent incompatibility between versions. We have hesitated to update the "default" JDBC driver there, then. (E.g. default options, ...) Instead, you should be able to switch the JDBC driver version with the driver_path option by yourself, without rebuilding the plugin. Please try that for a while.
See also: https://github.com/embulk/embulk-input-jdbc/issues/238
Hi, @hiroyuki-sato
Thank you for noticing driver_path
option!
I created a reproducible repo for the error. https://github.com/rajyan/embulk-mysql-utf8
The error occurred by setting options: { characterEncoding: UTF-8 }
, and using Connecotor/J to 5.1.49 didn't help. (maybe we can't change the charset to utf8mb4 by connector option?)
Only setting
edit: updated the repro correctlymysqld --character-set-server=utf8mb4 --collation-server=utf8mb4_general_ci
solved the issue.
Sorry, I understood my problem and maybe it was not related. The unicode error happens when inserting to the intermediate table (because the default charset is not utf8mb4 the columns of the intermediate tables are not utf8mb4)
Updated the repro by adding create_table_option: DEFAULT CHARSET=utf8mb4
https://github.com/rajyan/embulk-mysql-utf8/commit/80880d887fc6dd7b9a613e2645289dc03d30eacd
I could confirm that the error only happens in the current connector, and upgrading Connecotor/J to 5.1.49 can solve the error.
Although the inserted 4 byte chars are broken even using 5.1.49 ...
Although the inserted 4 byte chars are broken even using 5.1.49 ...
Sorry never mind. It was just my mysql client.
I can confirm that upgrading to Connecotor/J to 5.1.49 can solve the error! π https://github.com/rajyan/embulk-mysql-utf8/commit/0b32bcf8bfee08249b6c2433870ed626fa794854
Thank you for https://github.com/embulk/embulk-input-jdbc/issues/238 and https://github.com/embulk/embulk-input-jdbc/pull/237#issuecomment-1202003618
because there are several simple work arounds/solutions like using driver_path
or setting server/default charsets, Iβm not that in trouble with this issue now:+1:
Summary of the problem You'll get
org.embulk.exec.PartialExecutionException: java.lang.RuntimeException: java.sql.BatchUpdateException: Incorrect string value: '\xF0\x9F\xA4\xA3' for column ......
error if you are not setting utf8mb4
in the mysql database default charset.
This error occurs even you are setting
utf8mb4
options : { useUnicode: true, characterEncoding: UTF-8 }
because Connector/J before version 5.1.47 uses utf8 (utf8mb3) as a default for the connection with options : { useUnicode: true, characterEncoding: UTF-8 }
.
The solution now is
utf8mb4
ordriver_path
utf8
in the connector has changed to utf8mb4
as explained above
My mysql uses code utf8mb4, and mysq's temporary table code is utf8mb4, When I insert data into mysql temporary table, it has the following error:
org.embulk.exec.PartialExecutionException: java.lang.RuntimeException: java.sql.BatchUpdateException: Incorrect string value: '\xF0\x9F\xA4\xA3' for column ......
Can someone help me solve this problem? thanks.