embulk / embulk-output-jdbc

MySQL, PostgreSQL, Redshift and generic JDBC output plugins for Embulk
Other
88 stars 86 forks source link

Fix NString value to JSON #341

Closed t3t5u closed 4 months ago

t3t5u commented 5 months ago

Fix NString value to JSON same as StringColumnSetter.

t3t5u commented 5 months ago

@dmikurube @hiroyuki-sato Please review. 🙇‍♂️

hiroyuki-sato commented 5 months ago

Hello, @t3t5u

Thank you for creating this PR. Could you explain why this PR is needed? Did #320 cause a problem?

t3t5u commented 5 months ago

@hiroyuki-sato

As follows, TEXT is mapped to java.sql.Types.LONGVARCHAR and NVARCHAR is mapped to java.sql.Types.NVARCHAR. https://learn.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-ver16#data-type-mappings

Also, java.sql.Types.LONGVARCHAR uses StringColumnSetter and java.sql.Types.NVARCHAR uses NStringColumnSetter. https://github.com/embulk/embulk-output-jdbc/blob/master/embulk-output-jdbc/src/main/java/org/embulk/output/jdbc/setter/ColumnSetterFactory.java#L135 https://github.com/embulk/embulk-output-jdbc/blob/master/embulk-output-jdbc/src/main/java/org/embulk/output/jdbc/setter/ColumnSetterFactory.java#L142

However, NStringColumnSetter does not call toJson, but StringColumnSetter does. https://github.com/embulk/embulk-output-jdbc/blob/master/embulk-output-jdbc/src/main/java/org/embulk/output/jdbc/setter/StringColumnSetter.java#L64 https://github.com/embulk/embulk-output-jdbc/blob/master/embulk-output-jdbc/src/main/java/org/embulk/output/jdbc/setter/NStringColumnSetter.java#L64

Therefore, #320 cause an incompatibility where inputting as JSON results in null.

hiroyuki-sato commented 5 months ago

Hello, @t3t5u

Thank you for your comment. LGTM👍

Just to double-check, please allow me to confirm. Is my understanding correct?

This PR fixes if input data contains JSON data and uses column type nvarchar in a column_opions, write JSON string instead of null string. This problem was introduced #320. (embulk-output-sqlserver: v0.10.3 )

example config

out:
  type: sqlserver
  column_options:
    my_col_1: {type: 'NVARCHAR'}

@dmikurube, @hito4t Please take a look when you get a chance.

dmikurube commented 5 months ago

@t3t5u Setting a JSON value into a stringified NString sounds basically more reasonable than setting a JSON value always into null NString, but what we are always concerned are about :

  1. compatibility "with earlier versions"
  2. (in case of JDBC plugins) unexpected impacts on other types of JDBC plugins (e.g. you showed an example with SQL Server, but how about MySQL, PostgreSQL, or any other possible JDBC plugins like Oracle, DB2, other third-party.)

Just making a small change can easily cause an incompatibility like you experienced in #320 as shown here. #320 looked compatible enough, but it was actually not.

Please confirm more with (try iterating over) possible combinations of input Embulk data types x output RDB data types x JDBC driver types.

t3t5u commented 5 months ago

Thank you for your response.

@hiroyuki-sato

This PR fixes if input data contains JSON data and uses column type nvarchar in a column_opions, write JSON string instead of null string. This problem was introduced #320. (embulk-output-sqlserver: v0.10.3 )

This understanding is correct.

@dmikurube

Please confirm more with (try iterating over) possible combinations of input Embulk data types x output RDB data types x JDBC driver types.

Okay. But please give me some time.

t3t5u commented 5 months ago

@dmikurube @hiroyuki-sato

The following is the results of the confirmation.

First, NCHAR or NVARCHAR has no meaning in MySQL, PostgreSQL, Redshift, Oracle, or Db2, only in SQL Server.

Even if syntactically available, NCHAR or NVARCHAR is essentially Unicode-restricted CHAR or VARCHAR, and as far as I can find out, there is no type that maps to java.sql.Types.NCHAR, java.sql.Types.NVARCHAR, java.sql.Types.LONGNVARCHAR, or java.sql.Types.NCLOB on JDBC, like in SQL Server.

References:

And I have confirmed MySQL, PostgreSQL, Redshift, and SQL Server using the following scripts.

Please confirm the results.

hiroyuki-sato commented 5 months ago

@t3t5u Thank you for your work. Can you show us the test results just in case?

I read this file change. but It doesn't contain test results. https://github.com/trocco-io/embulk-output-jdbc/pull/9/files

t3t5u commented 5 months ago

@hiroyuki-sato

Added raw result logs (credentials are masked). https://github.com/trocco-io/embulk-output-jdbc/pull/9/commits/3d57644e9e6c533cbfcb5918b506797c95671a17

Thank you.

hiroyuki-sato commented 5 months ago

Hello, @t3t5u Thank you for your work.

Questions

I confirmed that all of test results are complete the same this description. https://github.com/trocco-io/embulk-output-jdbc/pull/9#issue-2204797736

Input

input data1

column value
test_boolean true
test_long 123
test_double 1.23
test_string あいうえお
test_timestamp 1999-12-31 23:59:59.000000 +0000
test_json { "キー": "値" }
test_json_text { "キー": "値" }
test_json_string { "キー": "値" }
test_json_nstring { "キー": "値" }

input data2

column value
test_boolean" false
test_long" 456
test_double" 4.56
test_string" かきくけこ
test_timestamp" 2000-01-01 00:00:00.000000 +0000
test_json [{ "キー1": "値1"},{"キー2": "値2"}]
test_json_text [{ "キー1": "値1"},{"キー2": "値2"}]
test_json_string [{ "キー1": "値1"},{"キー2": "値2"}]
test_json_nstring [{ "キー1": "値1"},{"キー2": "値2"}]
column embulk type column_options(MySQL,PostgreSQL) Redshift SQL Server
test_boolean" boolean
test_long" long
test_double" double
test_string" string
test_timestamp" timestamp (TIMESTAMPZ(PG),DATETIME(MySQL)) {type: TIMESTAMPZ } {type: DATETIME }
test_json json
test_json_text json {type: TEXT} {type: TEXT} {type: TEXT }
test_json_string json {type: TEXT, value_type: string } {type: 'VARCHAR(65535)', value_type: string} {type: 'VARCHAR(max)', value_type: string }
test_json_nstring json {type: TEXT, value_type: nstring } {type: 'VARCHAR(65535)', value_type: nstring} {type: 'VARCHAR(max)', value_type: nstring }

Test MySQL(default,8.3.0), PostgreSQL(default,42.7.3) and Redshift

input data1

column Before Before Comment
test_boolean true true
test_long 123 123
test_double 1.23 1.23
test_string あいうえお あいうえお
test_timestamp 1999-12-31 23:59:59.000000 +0000 1999-12-31 23:59:59.000000 +0000
test_json { "キー": "値" } { "キー": "値" }
test_json_text { "キー": "値" } { "キー": "値" }
test_json_string { "キー": "値" } { "キー": "値" }
test_json_nstring null { "キー": "値" } FIXED THIS

input data2

column value value comment
test_boolean" false false
test_long" 456 456
test_double" 4.56 4.56
test_string" かきくけこ かきくけこ
test_timestamp" 2000-01-01 00:00:00.000000 +0000 2000-01-01 00:00:00.000000 +0000
test_json [{ "キー1": "値1"},{"キー2": "値2"}] [{ "キー1": "値1"},{"キー2": "値2"}]
test_json_text [{ "キー1": "値1"},{"キー2": "値2"}] [{ "キー1": "値1"},{"キー2": "値2"}]
test_json_string [{ "キー1": "値1"},{"キー2": "値2"}] [{ "キー1": "値1"},{"キー2": "値2"}]
test_json_nstring null [{ "キー1": "値1"},{"キー2": "値2"}] FIXED THIS

SQL Server

input data1

column Before Before COMMENT
test_boolean true true
test_long 123 123
test_double 1.23 1.23
test_string あいうえお あいうえお
test_timestamp 1999-12-31 23:59:59.000000 +0000 1999-12-31 23:59:59.000000 +0000
test_json null { "キー": "値" } FIXED THIS
test_json_text { "??": "?" } { "??": "?" } Multibyte strings are garbled
test_json_string { "??": "?" } { "??": "?" } Multibyte strings are garbled
test_json_nstring null { "キー": "値" } FIXED THIS

input data2

column value value COMMENT
test_boolean" false false
test_long" 456 456
test_double" 4.56 4.56
test_string" かきくけこ かきくけこ
test_timestamp" 2000-01-01 00:00:00.000000 +0000 2000-01-01 00:00:00.000000 +0000
test_json null [{ "キー1": "値1"},{"キー2": "値2"}] FIXED THIS
test_json_text [{ "??1": "?1"},{"??2": "?2"}] [{ "??1": "?1"},{"??2": "?2"}] Multibyte strings are garbled
test_json_string [{ "??1": "?1"},{"??2": "?2"}] [{ "??1": "?1"},{"??2": "?2"}] Multibyte strings are garbled
test_json_nstring null [{ "キー1": "値1"},{"キー2": "値2"}] FIXED THIS

result.tgz

t3t5u commented 5 months ago

@hiroyuki-sato

What version set null value when column_options set to value_type: nstring?

Since early versions, if value_type is explicitly set to nstring, json columns have been null in all JDBC plugins.

Before #320, if value_type was not set, there was no type that mapped to nstring by default. After #320, even if value_type is not set, json (and string) columns are now mapped to nstring by default in sqlserver.

Do we need to check other json values like below?

Updated test scripts & logs.

In summary, json with nstring will always be null before the fix, and will only be null if the input is null after the fix.

What do you think to add this test code to this project?

I don't think this scrips are suitable for use in CI.

I think it should only be used as a reference.

hiroyuki-sato commented 5 months ago

Hello, @t3t5u

I confirmed this fix looking good for me. 👍 I have a confirm about versioning. (What next version number)

@dmikurube , @hito4t Please take a look when you get a chance.

Inptut data

input data1

column value
test_boolean null
test_long null
test_double null
test_string null
test_timestamp null
test_json null
test_json_text null
test_json_string null
test_json_nstring null

input data2

column value
test_boolean null
test_long null
test_double null
test_string null
test_timestamp null
test_json 12345
test_json_text 12345
test_json_string 12345
test_json_nstring 12345

input data3

column value
test_boolean null
test_long null
test_double null
test_string null
test_timestamp null
test_json 123.45
test_json_text 123.45
test_json_string 123.45
test_json_nstring 123.45

input data4

column value
test_boolean null
test_long null
test_double null
test_string null
test_timestamp null
test_json hoge
test_json_text hoge
test_json_string hoge
test_json_nstring hoge

MySQL, PostgreSQL, Redshift

column input before after
test_boolean null null null
test_long null null null
test_double null null null
test_string null null null
test_timestamp null null null
test_json null null null
test_json_text null null null
test_json_string null null null
test_json_nstring null null null

input data2

column input before after
test_boolean null null null
test_long null null null
test_double null null null
test_string null null null
test_timestamp null null null
test_json 12345 12345 12345
test_json_text 12345 12345 12345
test_json_string 12345 12345 12345
test_json_nstring 12345 null 12345 FIX THIS

input data3

column input before after
test_boolean null null null
test_long null null null
test_double null null null
test_string null null null
test_timestamp null null null
test_json 123.45 123.45 123.45
test_json_text 123.45 123.45 123.45
test_json_string 123.45 123.45 123.45
test_json_nstring 123.45 null 123.45 FIX THIS

input data4

column value value value
test_boolean null null null
test_long null null null
test_double null null null
test_string null null null
test_timestamp null null null
test_json hoge hoge hoge
test_json_text hoge hoge hoge
test_json_string hoge hoge hoge
test_json_nstring hoge null hoge FIX THIS

SQL Server

column input before after
test_boolean null null null
test_long null null null
test_double null null null
test_string null null null
test_timestamp null null null
test_json null null null
test_json_text null null null
test_json_string null null null
test_json_nstring null null null

input data2

column input before after
test_boolean null null null
test_long null null null
test_double null null null
test_string null null null
test_timestamp null null null
test_json 12345 null 12345 FIX THIS
test_json_text 12345 12345 12345
test_json_string 12345 12345 12345
test_json_nstring 12345 null 12345 FIX THIS

input data3

column input before after
test_boolean null null null
test_long null null null
test_double null null null
test_string null null null
test_timestamp null null null
test_json 123.45 null 123.45 FIX THIS
test_json_text 123.45 123.45 123.45
test_json_string 123.45 123.45 123.45
test_json_nstring 123.45 null 123.45 FIX THIS

input data4

column value value value
test_boolean null null null
test_long null null null
test_double null null null
test_string null null null
test_timestamp null null null
test_json hoge null hoge FIX THIS
test_json_text hoge hoge hoge
test_json_string hoge hoge hoge
test_json_nstring hoge null hoge FIX THIS
dmikurube commented 4 months ago

Merging...

hito4t commented 3 months ago

I also think the new specification is desirable. 👍