datastax / dsbulk

DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Apache License 2.0
85 stars 30 forks source link

dsbulk unload stuck when config -maxConcurrentFiles (write concurrency) greater than 1 #463

Open thoongnv opened 1 year ago

thoongnv commented 1 year ago

dsbulk version: 1.10.0

I'm unloading 10000000 rows from C* table with by using LIMIT query

dsbulk unload -query "SELECT col1, col2 FROM keyspace.table LIMIT 10000000" -maxRecords 1000000 -header false -verbosity high --connector.csv.compression gzip -url table.csv.gz

The command generates 1 read concurrency & 4 write concurrency, checking the logs I didn't find Operation UNLOAD_20230216-042948-286777 closed. line as usual, and still see dsbulk process when checking with ps aux

     total | failed | rows/s | p50ms |  p99ms | p999ms
10,000,000 |      0 | 97,745 | 50.37 | 167.77 | 289.41
Operation UNLOAD_20230216-042948-286777 completed successfully in 1 minute and 45 seconds.
Operation UNLOAD_20230216-042948-286777 closing.
Done writing file:/app/table.csv.gz/output-000011.csv.gz
Done writing file:/app/table.csv.gz/output-000009.csv.gz
Done writing file:/app/table.csv.gz/output-000010.csv.gz
Done writing file:/app/table.csv.gz/output-000012.csv.gz

This bug was not found in dsbulk version: 1.9.1 or set -maxConcurrentFiles 1

┆Issue is synchronized with this Jira Task by Unito

maksonlee commented 7 months ago

same issue.