datastax / dsbulk

DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Apache License 2.0
85 stars 30 forks source link

Update vector support to work done in JAVA-3061, add JSON codecs #480

Closed absurdfarce closed 1 year ago

absurdfarce commented 1 year ago

Update dsbulk to work with latest changes in JAVA-3061

absurdfarce commented 1 year ago

Local testing looks good:

mvn pacakge -Prelease
bin/dsbulk load -url "./../vector_test_data_string.csv" -k test -t foo
bin/dsbulk unload -k test -t foo
bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"
truncate test.foo;
bin/dsbulk load -url "./../vector_test_data_json" -k test -t foo -c json
bin/dsbulk unload -k test -t foo -c json

Jenkins build currently failing since the relevant Java driver hasn't been published yet.

absurdfarce commented 1 year ago

Following the example of what was outlined in the earlier PR:

Given the following table:

cqlsh> describe test.foo;

CREATE TABLE test.foo (
    i int PRIMARY KEY,
    j vector<float, 3>
);

CREATE CUSTOM INDEX ann_index ON test.foo (j) USING 'StorageAttachedIndex';
cqlsh> select * from test.foo;

 i | j
---+---

(0 rows)

With the changes in this PR we can now load and unload JSON data into this table:

$ cat ../vector_test_data_json/one.json 
{
    "i":1,
    "j":[8, 2.3, 58]
}
$ cat ../vector_test_data_json/two.json 
{
    "i":2,
    "j":[1.2, 3.4, 5.6]
}
$ cat ../vector_test_data_json/five.json 
{
    "i":5,
    "j":[23, 18, 3.9]
}
$ bin/dsbulk load -url "./../vector_test_data_json" -k test -t foo -c json
...
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    3 |      0 |     16 | 37.18 | 39.58 |  39.58 |    1.00
...
$ bin/dsbulk unload -k test -t foo -c json
...
{"i":5,"j":[23.0,18.0,3.9]}
{"i":1,"j":[8.0,2.3,58.0]}
{"i":2,"j":[1.2,3.4,5.6]}
total | failed | rows/s | p50ms | p99ms | p999ms
    3 |      0 |     14 |  2.58 |  2.87 |   2.87
...

Data on the server side matches up to what we'd expect:

cqlsh> select * from test.foo;

 i | j
---+-----------------
 5 |   [23, 18, 3.9]
 1 |    [8, 2.3, 58]
 2 | [1.2, 3.4, 5.6]

(3 rows)
cqlsh> select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1;

 j
-----------------
 [1.2, 3.4, 5.6]

(1 rows)
msmygit commented 1 year ago

bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"

Shouldn't this be as below?

bin/dsbulk unload -query "select j from test.foo where j ORDER BY ann of [3.4, 7.8, 9.1] limit 1"
absurdfarce commented 1 year ago

Good find @msmygit ! I originally put this code together using what is now a quite old build of datastax/cassandra. I think the "order by" syntax was added after I did my original work so you are quite right, what I had above is now out-of-date. Based on some experiments with cqlsh on a version built from source as of about an hour ago it looks like the correct syntax now is actually the following:

select j from test.foo ORDER BY j ann of [3.4, 7.8, 9.1] limit 1;

The good news is that the CQL parser contained by dsbulk appears to be fine with that syntax as well:

$ bin/dsbulk unload -query "select j from test.foo order by j ann of [3.4, 7.8, 9.1] limit 1"
Operation directory: /work/git/dsbulk/dist_test/dsbulk-1.11.0/logs/UNLOAD_20230714-061839-089434
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
    1 |      0 |      5 |  9.40 |  9.44 |   9.44

Does all that hang together for you?