datastax / dsbulk

DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Apache License 2.0
85 stars 30 forks source link

Cannot import multiple values in a map<T,T> column using CSV files #479

Open danieljaderpellattiero opened 1 year ago

danieljaderpellattiero commented 1 year ago

Disclaimer

I have managed to overcome the problem that I am going to describe using a workaround, therefore the following issue intend to highlight an unexpected behaviour of the loader.

The "problem"

I have defined a table into Apache Cassandra which has a column of type map<smallint, blob> and If I try to insert a record that has more than one entry in that column the loader ends with a parsing error.

Development environment

I have configured an Apache Cassandra cluster (3 nodes) using Docker compose (Docker Desktop 4.19.0) with the DataStax image: "datastax/dse-server:6.8.34-ubi7". I am a Windows user (Windows 11 Home 22H2) therefore I am using the WSL 2 backend configured with 8GB of RAM and 4 virtual processors. (via the .wslconfig file) The folder "dsbulk-1.10.0" has been copied inside one of these nodes in the "_:/opt/dse/_" directory.

Database schema

I have defined my database schema using the following CQL commands:

Keyspace

CREATE KEYSPACE IF NOT EXISTS ks WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': '3' };  
CONSISTENCY QUORUM; 

Table

CREATE TABLE IF NOT EXISTS test (  
col_1 blob,  
col_2 smallint,  
col_3 map<smallint, blob>,  
col_4 map<smallint, tinyint>,  
PRIMARY KEY ((col_1), col_2)  
) WITH CLUSTERING ORDER BY (col_2 ASC);

The unexpected behaviour

Inside the "_:/opt/dse/dsbulk-1.10.0/bin_" directory I have copied the .csv file, containing the following record:

col_1,col_2,col_3,col_4
0x0000000000000000000000000000000000000000,1,{"1":"0x0000000000000000000000000000000000000000","2":"0x0000000000000000000000000000000000000000"},{"1":"0","2":"0"}

I have tried to load it using the following command: root@node1:~/dsbulk-1.10.0/bin# ./dsbulk load -url ./file.csv -k ks -t test -header true but it fails and, inside the "connector-errors.log" file I have read the following error: java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 6. To my eyes it seems like the parser is unable to contextualize the usage of the commas inside the .csv file. Am I missing something ?

The workaround

I simply replaced all the commas (except for those inside the map columns) with a custom character (eg. "|") and then I have specified the parameter -delim '|' to the parser.

My question

Did I make a syntax non-compliant csv file in the first place or was it a pure parser error ?

I thank in advance anyone who can enlighten me about it.

┆Issue is synchronized with this Jira Task by Unito