DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Apache License 2.0
85
stars
30
forks
source link
Cannot import multiple values in a map<T,T> column using CSV files #479
I have managed to overcome the problem that I am going to describe using a workaround, therefore the following issue intend to highlight an unexpected behaviour of the loader.
The "problem"
I have defined a table into Apache Cassandra which has a column of type map<smallint, blob> and If I try to insert a record that has more than one entry in that column the loader ends with a parsing error.
Development environment
I have configured an Apache Cassandra cluster (3 nodes) using Docker compose (Docker Desktop 4.19.0) with the DataStax image: "datastax/dse-server:6.8.34-ubi7".
I am a Windows user (Windows 11 Home 22H2) therefore I am using the WSL 2 backend configured with 8GB of RAM and 4 virtual processors. (via the .wslconfig file)
The folder "dsbulk-1.10.0" has been copied inside one of these nodes in the "_:/opt/dse/_" directory.
Database schema
I have defined my database schema using the following CQL commands:
Keyspace
CREATE KEYSPACE IF NOT EXISTS ks WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': '3' };
CONSISTENCY QUORUM;
Table
CREATE TABLE IF NOT EXISTS test (
col_1 blob,
col_2 smallint,
col_3 map<smallint, blob>,
col_4 map<smallint, tinyint>,
PRIMARY KEY ((col_1), col_2)
) WITH CLUSTERING ORDER BY (col_2 ASC);
The unexpected behaviour
Inside the "_:/opt/dse/dsbulk-1.10.0/bin_" directory I have copied the .csv file, containing the following record:
I have tried to load it using the following command:
root@node1:~/dsbulk-1.10.0/bin# ./dsbulk load -url ./file.csv -k ks -t test -header true
but it fails and, inside the "connector-errors.log" file I have read the following error:
java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 6.
To my eyes it seems like the parser is unable to contextualize the usage of the commas inside the .csv file. Am I missing something ?
The workaround
I simply replaced all the commas (except for those inside the map columns) with a custom character (eg. "|") and then I have specified the parameter -delim '|' to the parser.
My question
Did I make a syntax non-compliant csv file in the first place or was it a pure parser error ?
I thank in advance anyone who can enlighten me about it.
Disclaimer
I have managed to overcome the problem that I am going to describe using a workaround, therefore the following issue intend to highlight an unexpected behaviour of the loader.
The "problem"
I have defined a table into Apache Cassandra which has a column of type map<smallint, blob> and If I try to insert a record that has more than one entry in that column the loader ends with a parsing error.
Development environment
I have configured an Apache Cassandra cluster (3 nodes) using Docker compose (Docker Desktop 4.19.0) with the DataStax image: "datastax/dse-server:6.8.34-ubi7". I am a Windows user (Windows 11 Home 22H2) therefore I am using the WSL 2 backend configured with 8GB of RAM and 4 virtual processors. (via the .wslconfig file) The folder "dsbulk-1.10.0" has been copied inside one of these nodes in the "_:/opt/dse/_" directory.
Database schema
I have defined my database schema using the following CQL commands:
Keyspace
Table
The unexpected behaviour
Inside the "_:/opt/dse/dsbulk-1.10.0/bin_" directory I have copied the .csv file, containing the following record:
I have tried to load it using the following command:
root@node1:~/dsbulk-1.10.0/bin# ./dsbulk load -url ./file.csv -k ks -t test -header true
but it fails and, inside the "connector-errors.log" file I have read the following error:java.lang.IllegalArgumentException: Expecting record to contain 4 fields but found 6.
To my eyes it seems like the parser is unable to contextualize the usage of the commas inside the .csv file. Am I missing something ?The workaround
I simply replaced all the commas (except for those inside the map columns) with a custom character (eg. "|") and then I have specified the parameter -delim '|' to the parser.
My question
Did I make a syntax non-compliant csv file in the first place or was it a pure parser error ?
I thank in advance anyone who can enlighten me about it.
┆Issue is synchronized with this Jira Task by Unito