DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Apache License 2.0
85
stars
30
forks
source link
DSBulk DELETE can not accept any ranges on the clustering column when used within -query #490
Executing a SELECT query with conditions on these columns works in cqlsh, but attempts to perform a similar operation with DSBulk result in failure due to the inability to specify a range on the clustering key sequence_no. For instance, in cqlsh, the following query is successful:```
cqlsh> Select * FROM biswa_ks.live_mom_transaction WHERE str_num='006' and day_no=7 and transaction_uid='14d6f78c-c468-4b71-84ce-0a8009bf3dc1' and trans_typ= 'SALE' and sequence_no > '0';
str_num | day_no | transaction_uid | trans_typ | sequence_no | created_date | raw_data | updated_date
---------+--------+--------------------------------------+-----------+-------------+---------------------------------+--------------------------------+---------------------------------
006 | 7 | 14d6f78c-c468-4b71-84ce-0a8009bf3dc1 | SALE | seq-188 | 2024-06-05 00:55:41.789000+0000 | {"item":"sample","price":9.99} | 2024-06-05 01:55:41.789000+0000
However, trying to replicate this deletion with DSBulk, using the command below, results in an error because DSBulk does not accept a range condition on the clustering column sequence_no, even though cqlsh can handle it:`
cat /home/automaton/data/output-000001.csv|./dsbulk load -query "DELETE FROM biswa_ks.live_mom_transaction WHERE str_num=:str_num and day_no=:day_no and transaction_uid=:transaction_uid and trans_typ=:trans_typ and sequence_no > '0' "
Operation directory: /home/automaton/dsbulk-1.9.0/bin/logs/LOAD_20240312-155253-850242
Operation LOAD_20240312-155253-850242 failed: Missing required primary key column sequence_no from schema.mapping or schema.query.
This also means DSBulk generated deletes (using full PK & not cluster range) may create a large amount of tombstones e.g. if there are 10 sequence_no on a avg per str_num, day_no, transaction_uid, trans_typ combo, then it will create 10x tombstones
same select works in DSBULK
[dse_test_kmip11:0] automaton@ip-10-166-76-196:~/dsbulk-1.9.0/bin$ ./dsbulk unload -url /home/automaton/data -query "Select * FROM biswa_ks.live_mom_transaction WHERE str_num='006' and day_no=7 and transaction_uid='068cfdf3-ae5b-48f1-a8de-27b70e2c3daf' and trans_typ= 'SALE' and sequence_no > '0' "
Operation directory: /home/automaton/dsbulk-1.9.0/bin/logs/UNLOAD_20240312-202339-691820
total | failed | rows/s | p50ms | p99ms | p999ms
1 | 0 | 3 | 20.51 | 20.58 | 20.58
Operation UNLOAD_20240312-202339-691820 completed successfully in less than one second.
DSBulk DELETE can not accept any ranges on the clustering column where as SELECT does Also cqlsh can work with the same
Consider this is the table from where customer needs to delete the data
Executing a SELECT query with conditions on these columns works in cqlsh, but attempts to perform a similar operation with DSBulk result in failure due to the inability to specify a range on the clustering key sequence_no. For instance, in cqlsh, the following query is successful:```
However, trying to replicate this deletion with DSBulk, using the command below, results in an error because DSBulk does not accept a range condition on the clustering column sequence_no, even though cqlsh can handle it:`
Operation LOAD_20240312-155253-850242 failed: Missing required primary key column sequence_no from schema.mapping or schema.query.
This also means DSBulk generated deletes (using full PK & not cluster range) may create a large amount of tombstones e.g. if there are 10 sequence_no on a avg per str_num, day_no, transaction_uid, trans_typ combo, then it will create 10x tombstones
same select works in DSBULK
slack conversation is there in https://datastax.slack.com/archives/C6B5L9GQN/p1710264412970939