Closed absurdfarce closed 1 year ago
Local testing looks good:
mvn pacakge -Prelease
bin/dsbulk load -url "./../vector_test_data_string.csv" -k test -t foo
bin/dsbulk unload -k test -t foo
bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"
truncate test.foo;
bin/dsbulk load -url "./../vector_test_data_json" -k test -t foo -c json
bin/dsbulk unload -k test -t foo -c json
Jenkins build currently failing since the relevant Java driver hasn't been published yet.
Following the example of what was outlined in the earlier PR:
Given the following table:
cqlsh> describe test.foo;
CREATE TABLE test.foo (
i int PRIMARY KEY,
j vector<float, 3>
);
CREATE CUSTOM INDEX ann_index ON test.foo (j) USING 'StorageAttachedIndex';
cqlsh> select * from test.foo;
i | j
---+---
(0 rows)
With the changes in this PR we can now load and unload JSON data into this table:
$ cat ../vector_test_data_json/one.json
{
"i":1,
"j":[8, 2.3, 58]
}
$ cat ../vector_test_data_json/two.json
{
"i":2,
"j":[1.2, 3.4, 5.6]
}
$ cat ../vector_test_data_json/five.json
{
"i":5,
"j":[23, 18, 3.9]
}
$ bin/dsbulk load -url "./../vector_test_data_json" -k test -t foo -c json
...
total | failed | rows/s | p50ms | p99ms | p999ms | batches
3 | 0 | 16 | 37.18 | 39.58 | 39.58 | 1.00
...
$ bin/dsbulk unload -k test -t foo -c json
...
{"i":5,"j":[23.0,18.0,3.9]}
{"i":1,"j":[8.0,2.3,58.0]}
{"i":2,"j":[1.2,3.4,5.6]}
total | failed | rows/s | p50ms | p99ms | p999ms
3 | 0 | 14 | 2.58 | 2.87 | 2.87
...
Data on the server side matches up to what we'd expect:
cqlsh> select * from test.foo;
i | j
---+-----------------
5 | [23, 18, 3.9]
1 | [8, 2.3, 58]
2 | [1.2, 3.4, 5.6]
(3 rows)
cqlsh> select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1;
j
-----------------
[1.2, 3.4, 5.6]
(1 rows)
bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"
Shouldn't this be as below?
bin/dsbulk unload -query "select j from test.foo where j ORDER BY ann of [3.4, 7.8, 9.1] limit 1"
Good find @msmygit ! I originally put this code together using what is now a quite old build of datastax/cassandra. I think the "order by" syntax was added after I did my original work so you are quite right, what I had above is now out-of-date. Based on some experiments with cqlsh on a version built from source as of about an hour ago it looks like the correct syntax now is actually the following:
select j from test.foo ORDER BY j ann of [3.4, 7.8, 9.1] limit 1;
The good news is that the CQL parser contained by dsbulk appears to be fine with that syntax as well:
$ bin/dsbulk unload -query "select j from test.foo order by j ann of [3.4, 7.8, 9.1] limit 1"
Operation directory: /work/git/dsbulk/dist_test/dsbulk-1.11.0/logs/UNLOAD_20230714-061839-089434
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
1 | 0 | 5 | 9.40 | 9.44 | 9.44
Does all that hang together for you?
Update dsbulk to work with latest changes in JAVA-3061