exasol / mysql-virtual-schema

Virtual Schema for connecting MySQL as a data source to Exasol
MIT License
2 stars 0 forks source link

Charset conversion mysql-connector-j-8.0.31.jar #26

Closed djamadeus closed 1 year ago

djamadeus commented 1 year ago

When reading data from virtual schema I get a converison-error from UTF8 to ASCII,. Reading directly from the jdbc-connection works though.

EXA_DB.VIRTUAL_JOINTDB_OFFICE> Select from( IMPORT FROM JDBC AT MYSQL_JOINTDB_JDBC_CONNECTION STATEMENT 'SELECT from office.erstattungen_positionen limit 100') [2022-11-16 10:13:59] 100 rows retrieved starting from 1 in 1 s 85 ms (execution: 1 s 53 ms, fetching: 32 ms) EXA_DB.VIRTUAL_JOINTDB_OFFICE> Select * from virtual_jointDB_office."erstattungen_positionen" limit 100 [2022-11-16 10:21:58] [42636] ETL-3009: [Column=23 Row=0] [Charset conversion from 'UTF-8' to 'ASCII' failed [Invalid or incomplete multibyte or wide character]] (Session: 1749641517751205888).

On the Source-DB the charset is latin1 (yes i know :(, not my fault^^).

djamadeus commented 1 year ago

the column contains german letters like "Ü". The Problem here seems to be, that the virtual schema adapter gets latin1 fromt he column definition so it expects latin1, the mysql JDBC driver already converts this to utf-8 though.

Relativity74205 commented 1 year ago

We have a similar error message, however it is independent of the mysql driver version. However, going back to the previous virtual schema version (virtual-schema-dist-9.0.5-mysql-3.0.0.jar) solves the issue.

djamadeus commented 1 year ago

Jep switching back to 9.0.5 of the adapter script solved the issue for me as well

ckunki commented 1 year ago

Many thanks for reporting this issue. Please excuse that I answer only now. We will have a look at it.

ckunki commented 1 year ago

To summarize: version virtual-schema-dist-9.0.5-mysql-3.0.0.jar did not show the behavior described above while virtual-schema-dist-10.0.1-mysql-4.0.0.jar does so, right?

Hi @djamadeus could you please add the result of describing the table? (at least for one of the columns in question of being either ASCII / latin1 / or UTF-8)

In Exasol DB SQL you get this from DESCRIBE <schema>.<table-name>; with <schema> being the name of the virtual schema you created and <table-name> replaced by the name of your table, in total probably office.erstattungen_positionen, then.

After that I would like to exclude potential effects due to caching the schema description. We can do so by refreshing the virtual schema with ALTER VIRTUAL SCHEMA <schema> REFRESH; followed by an additional DESCRIBE <schema>.<table-name>;, see also https://docs.exasol.com/db/latest/sql/alter_schema.htm.

Based on that I will add more information.

djamadeus commented 1 year ago

@ckunki yes that only happens with version 10.0.1 Descripttion-result with 9.0.5-adapter:

COLUMN_NAME SQL_TYPE
id DECIMAL(10,0)
datum TIMESTAMP
datum_ursprung TIMESTAMP
userid VARCHAR(16) ASCII
orderid VARCHAR(32) ASCII
rekla_id VARCHAR(10) ASCII
quelle CHAR(15) ASCII
quelle_id VARCHAR(15) ASCII
art CHAR(19) ASCII
grund VARCHAR(25) ASCII
mitarbeiter VARCHAR(50) ASCII
bemerkung VARCHAR(160) ASCII
artnr VARCHAR(25) ASCII
sellid VARCHAR(15) ASCII
bezeichnung VARCHAR(50) ASCII
bezeichnung_zusatz VARCHAR(50) ASCII
anzahl DECIMAL(10,0)
einzelpreis DECIMAL(8,2)
brutto_netto CHAR(6) ASCII
steuersatz DECIMAL(4,1)
datum_bestellung TIMESTAMP
zahlungsart_bestellung VARCHAR(32) ASCII
prioritaet CHAR(6) ASCII
status CHAR(11) ASCII
status_bemerkung VARCHAR(160) ASCII
id_gutschrift DECIMAL(10,0)
djamadeus commented 1 year ago

do you want me to test the refresh-flag with the 10 or 9.x adapter?

ckunki commented 1 year ago

Thank you very much for the additional information. Please apply the refresh for the new version 10. Could please also tell me the version of the Exasol database you are using?

ckunki commented 1 year ago

Command DESCRIBE <schema>.<table-name>; shows that Exasol DB expects ASCII, which matches the error message above: ETL-3009: [Charset conversion from 'UTF-8' to 'ASCII' failed [Invalid or incomplete multibyte or wide character]]. The prefix ETL- of the error code indicates the message being generated by the ExaLoader which is a core component of the database.

Could you help me to get insight into the perspective of the virtual schema adapter (another component different from ExaLoader)?

You can do so by using EXPLAIN VIRTUAL <sql command>;, see https://docs.exasol.com/db/latest/sql/explain_virtual.htm. Please use the same SQL command that created the initial error message.

EXPLAIN VIRTUAL returns a table and if you could attach column "PUSHDOWN_SQL" to the current ticket, that will hopefully show which component indicated a UTF-8 encoding, which probably is incorrect and therefore cannot be converted to ASCII.

djamadeus commented 1 year ago

` CREATE VIRTUAL SCHEMA virtual_jointDB_office_test USING system_adapter_scripts.ADAPTER_SCRIPT_MYSQL WITH CONNECTION_NAME = 'MYSQL_JOINTDB_JDBC_CONNECTION' CATALOG_NAME = 'office';

Select * from virtual_jointDB_office_test."erstattungen_positionen"; [42636] ETL-3009: [Column=23 Row=0] [Charset conversion from 'UTF-8' to 'ASCII' failed [Invalid or incomplete multibyte or wide character]] (Session: 1750109444829478912)

ALTER VIRTUAL SCHEMA virtual_jointDB_office_test REFRESH;

Select * from virtual_jointDB_office_test."erstattungen_positionen"; [Column=23 Row=0] [Charset conversion from 'UTF-8' to 'ASCII' failed [Invalid or incomplete multibyte or wide character]] (Session: 1750109444829478912)

explain virtual Select * from virtual_jointDB_office_test."erstattungen_positionen"; Pushdown_SQL: IMPORT INTO (c1 DECIMAL(10, 0), c2 TIMESTAMP, c3 TIMESTAMP, c4 VARCHAR(16) ASCII, c5 VARCHAR(32) ASCII, c6 VARCHAR(10) ASCII, c7 CHAR(15) ASCII, c8 VARCHAR(15) ASCII, c9 CHAR(19) ASCII, c10 VARCHAR(25) ASCII, c11 VARCHAR(50) ASCII, c12 VARCHAR(160) ASCII, c13 VARCHAR(25) ASCII, c14 VARCHAR(15) ASCII, c15 VARCHAR(50) ASCII, c16 VARCHAR(50) ASCII, c17 DECIMAL(10, 0), c18 DECIMAL(8, 2), c19 CHAR(6) ASCII, c20 DECIMAL(4, 1), c21 TIMESTAMP, c22 VARCHAR(32) ASCII, c23 CHAR(6) ASCII, c24 CHAR(11) ASCII, c25 VARCHAR(160) ASCII, c26 DECIMAL(10, 0)) FROM JDBC AT MYSQL_JOINTDB_JDBC_CONNECTION STATEMENT 'SELECT erstattungen_positionen.id, erstattungen_positionen.datum, erstattungen_positionen.datum_ursprung, erstattungen_positionen.userid, erstattungen_positionen.orderid, erstattungen_positionen.rekla_id, erstattungen_positionen.quelle, erstattungen_positionen.quelle_id, erstattungen_positionen.art, erstattungen_positionen.grund, erstattungen_positionen.mitarbeiter, erstattungen_positionen.bemerkung, erstattungen_positionen.artnr, erstattungen_positionen.sellid, erstattungen_positionen.bezeichnung, erstattungen_positionen.bezeichnung_zusatz, erstattungen_positionen.anzahl, erstattungen_positionen.einzelpreis, erstattungen_positionen.brutto_netto, erstattungen_positionen.steuersatz, erstattungen_positionen.datum_bestellung, erstattungen_positionen.zahlungsart_bestellung, erstattungen_positionen.prioritaet, erstattungen_positionen.status, erstattungen_positionen.status_bemerkung, erstattungen_positionen.id_gutschrift FROM office.erstattungen_positionen'`

djamadeus commented 1 year ago

when I run the pushdown command: IMPORT INTO (c1 DECIMAL(10, 0), c2 TIMESTAMP, c3 TIMESTAMP, c4 VARCHAR(16) ASCII, c5 VARCHAR(32) ASCII, c6 VARCHAR(10) ASCII, c7 CHAR(15) ASCII, c8 VARCHAR(15) ASCII, c9 CHAR(19) ASCII, c10 VARCHAR(25) ASCII, c11 VARCHAR(50) ASCII, c12 VARCHAR(160) ASCII, c13 VARCHAR(25) ASCII, c14 VARCHAR(15) ASCII, c15 VARCHAR(50) ASCII, c16 VARCHAR(50) ASCII, c17 DECIMAL(10, 0), c18 DECIMAL(8, 2), c19 CHAR(6) ASCII, c20 DECIMAL(4, 1), c21 TIMESTAMP, c22 VARCHAR(32) ASCII, c23 CHAR(6) ASCII, c24 CHAR(11) ASCII, c25 VARCHAR(160) ASCII, c26 DECIMAL(10, 0)) FROM JDBC AT MYSQL_JOINTDB_JDBC_CONNECTION STATEMENT 'SELECT erstattungen_positionen.id, erstattungen_positionen.datum, erstattungen_positionen.datum_ursprung, erstattungen_positionen.userid, erstattungen_positionen.orderid, erstattungen_positionen.rekla_id, erstattungen_positionen.quelle, erstattungen_positionen.quelle_id, erstattungen_positionen.art, erstattungen_positionen.grund, erstattungen_positionen.mitarbeiter, erstattungen_positionen.bemerkung, erstattungen_positionen.artnr, erstattungen_positionen.sellid, erstattungen_positionen.bezeichnung, erstattungen_positionen.bezeichnung_zusatz, erstattungen_positionen.anzahl, erstattungen_positionen.einzelpreis, erstattungen_positionen.brutto_netto, erstattungen_positionen.steuersatz, erstattungen_positionen.datum_bestellung, erstattungen_positionen.zahlungsart_bestellung, erstattungen_positionen.prioritaet, erstattungen_positionen.status, erstattungen_positionen.status_bemerkung, erstattungen_positionen.id_gutschrift FROM office.erstattungen_positionen limit 100' [2022-11-23 12:30:26] [42636] ETL-3009: [Column=23 Row=0] [Charset conversion from 'UTF-8' to 'ASCII' failed [Invalid or incomplete multibyte or wide character]] (Session: 1750109444829478912)

djamadeus commented 1 year ago

If I counted right, the problem is erstattungen_positionen.prioritaet wich is an enum. Probably always using UTF8 for enums in the newer adapter?

djamadeus commented 1 year ago

descriptiontable with vs-10.0.1-adapter:

COLUMN_NAME SQL_TYPE NULLABLE DISTRIBUTION_KEY PARTITION_KEY
id DECIMAL(10,0) NULL NULL NULL
datum TIMESTAMP NULL NULL NULL
datum_ursprung TIMESTAMP NULL NULL NULL
userid VARCHAR(16) ASCII NULL NULL NULL
orderid VARCHAR(32) ASCII NULL NULL NULL
rekla_id VARCHAR(10) ASCII NULL NULL NULL
quelle CHAR(15) ASCII NULL NULL NULL
quelle_id VARCHAR(15) ASCII NULL NULL NULL
art CHAR(19) ASCII NULL NULL NULL
grund VARCHAR(25) ASCII NULL NULL NULL
mitarbeiter VARCHAR(50) ASCII NULL NULL NULL
bemerkung VARCHAR(160) ASCII NULL NULL NULL
artnr VARCHAR(25) ASCII NULL NULL NULL
sellid VARCHAR(15) ASCII NULL NULL NULL
bezeichnung VARCHAR(50) ASCII NULL NULL NULL
bezeichnung_zusatz VARCHAR(50) ASCII NULL NULL NULL
anzahl DECIMAL(10,0) NULL NULL NULL
einzelpreis DECIMAL(8,2) NULL NULL NULL
brutto_netto CHAR(6) ASCII NULL NULL NULL
steuersatz DECIMAL(4,1) NULL NULL NULL
datum_bestellung TIMESTAMP NULL NULL NULL
zahlungsart_bestellung VARCHAR(32) ASCII NULL NULL NULL
prioritaet CHAR(6) ASCII NULL NULL NULL
status CHAR(11) ASCII NULL NULL NULL
status_bemerkung VARCHAR(160) ASCII NULL NULL NULL
id_gutschrift DECIMAL(10,0) NULL NULL NULL
djamadeus commented 1 year ago

Exasol version is 7.1.15 btw

ckunki commented 1 year ago

Thank you very much for all the information. I am confident that this will enable us to find the correct answer and will come back in short.

djamadeus commented 1 year ago

My pleasure, Good Luck! :)

ckunki commented 1 year ago

If I counted right, the problem is erstattungen_positionen.prioritaet wich is an enum.

Thanks for this information. I counted the same and agree to your suspicion. Enum is not among the generic JDBC base types but a proprietary extension of MySql. Looking at the output of EXPLAIN VIRTUAL and DESCRIBE TABLE JDBC seems to map this to CHAR(6).

As EXPLAIN VIRTUAL and DESCRIBE TABLE identically show datatype CHAR(6) ASCII and in combination with the prefix ETL-3009 of the error message we now assume ExaLoader to not only generate the error message but also to trigger the failed conversion.

If this holds true then the virtual schema adapter cannot influence this behavior.

The only thing that still makes me frown is that the current ticket reports a change of behavior from version virtual-schema-dist-9.0.5-mysql-3.0.0.jar to virtual-schema-dist-10.0.1-mysql-4.0.0.jar. Did you make the observations all with the same database version Exasol 7.1.15?

djamadeus commented 1 year ago

jep. Using this as adapter on same DB works without problems: CREATE OR REPLACE JAVA ADAPTER SCRIPT system_adapter_scripts.ADAPTER_SCRIPT_MYSQL_ASCII AS %scriptclass com.exasol.adapter.RequestDispatcher; %jar /buckets/bucketfs1/bucket1/virtual-schema-dist-9.0.5-mysql-3.0.0.jar; %jar /buckets/bucketfs1/bucket1/mysql-connector-j-8.0.31.jar; / ; while this creates the reported behavior: CREATE OR REPLACE JAVA ADAPTER SCRIPT system_adapter_scripts.ADAPTER_SCRIPT_MYSQL AS %scriptclass com.exasol.adapter.RequestDispatcher; %jar /buckets/bucketfs1/bucket1/virtual-schema-dist-10.0.1-mysql-4.0.0.jar; %jar /buckets/bucketfs1/bucket1/mysql-connector-j-8.0.31.jar; / ; IN the changes to version 10 it says something about, not relying on analyzing the received data anymore but using the table meta-information instead... That sounds a lot like the actual culprit to me ;)

ckunki commented 1 year ago

changes to version 10 ... using the table meta-information instead... That sounds a lot like the actual culprit to me ;)

Thanks for mentioning that as this constantly goes through my mind, too. I only now found out that this already has been backported to Exasol version 7.1.14. So yes, this is still a candidate.

What I am still wondering about: How could the information be different from what we get from DESCRIBE TABLE? And actually we must assume that the meta-information includes an indication for UTF-8 while all the other places don't. Not mentioning any encoding would mean the same as UTF-8 is the default.

Would it be OK to ask for two additional log samples?

  1. from the data base / ExaLoader
  2. from the virtual schema and the meta-information

For (1) the relevant log files have names like *EtlJdbc_<etl_proc_num>_<job_num>*. Here is an example for mapping a JDBC data type to Exasol:

2022-11-22 23:07:42.438 debu: Mapped Column: types[0]=93, internalTypes[0]=INTERNALTYPES_TIMESTAMP

For (2) see https://docs.exasol.com/db/latest/database_concepts/virtual_schema/logging.htm

djamadeus commented 1 year ago

How would I access the file from (1) ? running a community version instance here. no root access to that blackbox as far as I know

ckunki commented 1 year ago

Sorry for assuming you might be able to access these logs. In the meantime I could do some deeper experiments and I am close to a proof of our current assumption which is matching your initial diagnosis.

VSMYSQL version 3.x inspected only the data to detect the data type and got UTF-8 which is accepted by Exasol DB when importing UTF-8 encoded values.

VSMYSQL version 4.x inspects the metadata of the JDBC connection to detect the data type of each database column. In our case this is probably CHAR with octet length 1, i.e. 1 byte per character which is more or less correct for encoding latin1. VSMYSQL maps this to Exasol data type ASCII.

When retrieving actual values in a select statement the JDBC driver converts these character to UTF-8 and Exasol database reports an error as it cannot convert these data to data type ASCII.

image

The question is now how to overcome this. I see the following options

I will ponder these options and discuss with my colleagues. Please tell me if you have a preference or if you see additional options.

djamadeus commented 1 year ago

Since detecting the type has worked flawlessly in the past, maybe having a parameter to fall back to the 3.x behaviour is best. I don't know the background why this was changed. Were there other issues with the old way, or is it "just" performance related?

Thx for your efforts btw!

ckunki commented 1 year ago

Thanks for contact information and feeback. BTW. the image was wrong - I replaced it 🥲

There were a number of reasons for the change in VSMYSQL:

Also guessing the data types from the values in the result set appeared to be inappropriate and risky. Although, we must must admit, that in your special case the old behavior seems to be preferable. 😃

In the meanwhile we did some experiments with settings of the connection which did not change the behavior.

The bug filed for MySQL Connector J: https://bugs.mysql.com/bug.php?id=69328, submitted: 27 May 2013, status Verified i.e. still open further narrows down the options ☹️.

In summary we are currently focusing option (O4) as proposed by you.

ckunki commented 1 year ago

To ensure correct understanding of the problem I investigated the behavior of the MySQL JDBC driver:

CREATE TABLE T (c1 CHAR(1), c2  ENUM('A', 'Ü')) CHARACTER SET latin1;
INSERT INTO T VALUES ('Ü', 'Ü');

The following results have been retrieved by using the MySQL JDBC driver com.mysql:mysql-connector-j:8.0.31.

Column Metadata for MySQL table 'T' provided by JDBC driver:

Column descriptions from query select * from T, as provided by JDBC driver: c1 CHAR(1) UTF8, c2 CHAR(1) UTF8.

I will also enhance the integration tests of VSMYSQL to reproduce this output.

djamadeus commented 1 year ago

glad to see this fixed. thanks alot!

djamadeus commented 1 year ago

btw same some problem occurs with postgres adapter...