Update vector support to work done in JAVA-3061, add JSON codecs

absurdfarce commented 1 year ago

Update dsbulk to work with latest changes in JAVA-3061

absurdfarce commented 1 year ago

Local testing looks good:

mvn pacakge -Prelease
bin/dsbulk load -url "./../vector_test_data_string.csv" -k test -t foo
bin/dsbulk unload -k test -t foo
bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"
truncate test.foo;
bin/dsbulk load -url "./../vector_test_data_json" -k test -t foo -c json
bin/dsbulk unload -k test -t foo -c json

Jenkins build currently failing since the relevant Java driver hasn't been published yet.

absurdfarce commented 1 year ago

Following the example of what was outlined in the earlier PR:

Given the following table:

cqlsh> describe test.foo;

CREATE TABLE test.foo (
    i int PRIMARY KEY,
    j vector<float, 3>
);

CREATE CUSTOM INDEX ann_index ON test.foo (j) USING 'StorageAttachedIndex';
cqlsh> select * from test.foo;

 i | j
---+---

(0 rows)

With the changes in this PR we can now load and unload JSON data into this table:

$ cat ../vector_test_data_json/one.json 
{
    "i":1,
    "j":[8, 2.3, 58]
}
$ cat ../vector_test_data_json/two.json 
{
    "i":2,
    "j":[1.2, 3.4, 5.6]
}
$ cat ../vector_test_data_json/five.json 
{
    "i":5,
    "j":[23, 18, 3.9]
}
$ bin/dsbulk load -url "./../vector_test_data_json" -k test -t foo -c json
...
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    3 |      0 |     16 | 37.18 | 39.58 |  39.58 |    1.00
...
$ bin/dsbulk unload -k test -t foo -c json
...
{"i":5,"j":[23.0,18.0,3.9]}
{"i":1,"j":[8.0,2.3,58.0]}
{"i":2,"j":[1.2,3.4,5.6]}
total | failed | rows/s | p50ms | p99ms | p999ms
    3 |      0 |     14 |  2.58 |  2.87 |   2.87
...

Data on the server side matches up to what we'd expect:

cqlsh> select * from test.foo;

 i | j
---+-----------------
 5 |   [23, 18, 3.9]
 1 |    [8, 2.3, 58]
 2 | [1.2, 3.4, 5.6]

(3 rows)
cqlsh> select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1;

 j
-----------------
 [1.2, 3.4, 5.6]

(1 rows)

msmygit commented 1 year ago

bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"

Shouldn't this be as below?

bin/dsbulk unload -query "select j from test.foo where j ORDER BY ann of [3.4, 7.8, 9.1] limit 1"

absurdfarce commented 1 year ago

Good find @msmygit ! I originally put this code together using what is now a quite old build of datastax/cassandra. I think the "order by" syntax was added after I did my original work so you are quite right, what I had above is now out-of-date. Based on some experiments with cqlsh on a version built from source as of about an hour ago it looks like the correct syntax now is actually the following:

select j from test.foo ORDER BY j ann of [3.4, 7.8, 9.1] limit 1;

The good news is that the CQL parser contained by dsbulk appears to be fine with that syntax as well:

$ bin/dsbulk unload -query "select j from test.foo order by j ann of [3.4, 7.8, 9.1] limit 1"
Operation directory: /work/git/dsbulk/dist_test/dsbulk-1.11.0/logs/UNLOAD_20230714-061839-089434
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
    1 |      0 |      5 |  9.40 |  9.44 |   9.44

Does all that hang together for you?

datastax / dsbulk

Update vector support to work done in JAVA-3061, add JSON codecs #480