VSS bug: "Could not find node in column segment tree!" when adding about 300,000 rows.

YJGit commented 4 months ago

What happens?

I have built duckdb 0.10.3 from source on arm centos8, and tried the VSS extension of duckdb. When I add 100,000 rows, it works fine. However, when I add about 300,000 rows, it throws an error: "Could not find node in column segment tree! Attempting to find row number '31578880' in 107 nodes" when I run the select command. Is there some advices?

Note: I also tried on x86 linux, and get the same error.

To Reproduce

Here are the code to reproduce:

create table vsstst(id Bigint, emb float[128]);
insert into vsstst SELECT * FROM read_parquet('output.parquet') where id < 300000;
CREATE INDEX vss_hnsw_index ON vsstst USING HNSW (emb);
SELECT emb FROM vsstst ORDER BY array_distance(emb, [0.0, 16.0, 35.0, 5.0, 32.0, 31.0, 14.0, 10.0, 11.0, 78.0, 55.0, 10.0, 45.0, 83.0, 11.0, 6.0, 14.0, 57.0, 102.0, 75.0, 20.0, 8.0, 3.0, 5.0, 67.0, 17.0, 19.0, 26.0, 5.0, 0.0, 1.0, 22.0, 60.0, 26.0, 7.0, 1.0, 18.0, 22.0, 84.0, 53.0, 85.0, 119.0, 119.0, 4.0, 24.0, 18.0, 7.0, 7.0, 1.0, 81.0, 106.0, 102.0, 72.0, 30.0, 6.0, 0.0, 9.0, 1.0, 9.0, 119.0, 72.0, 1.0, 4.0, 33.0, 119.0, 29.0, 6.0, 1.0, 0.0, 1.0, 14.0, 52.0, 119.0, 30.0, 3.0, 0.0, 0.0, 55.0, 92.0, 111.0, 2.0, 5.0, 4.0, 9.0, 22.0, 89.0, 96.0, 14.0, 1.0, 0.0, 1.0, 82.0, 59.0, 16.0, 20.0, 5.0, 25.0, 14.0, 11.0, 4.0, 0.0, 0.0, 1.0, 26.0, 47.0, 23.0, 4.0, 0.0, 0.0, 4.0, 38.0, 83.0, 30.0, 14.0, 9.0, 4.0, 9.0, 17.0, 23.0, 41.0, 0.0, 0.0, 2.0, 8.0, 19.0, 25.0, 23.0, 1.0]::FLOAT[128]) LIMIT 10;

Note: the data is randomly generated.

OS:

aarch64

DuckDB Version:

0.10.3

DuckDB Client:

use ./duckdb to run

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

No - Other reason (please specify in the issue body)

Did you include all code required to reproduce the issue?

[X] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

[ ] Yes, I have

szarnyasg commented 4 months ago

Hi @YJGit, thanks for opening this issue. Could you please provide a script to create the output.parquet file or share the file with us via other means (e.g. wetransfer)?

YJGit commented 4 months ago

Thanks for your replay, here is the complete code: For data generation:

dim = 128
rows = 500000
vec_ids = [i for i in range(rows)]
vecs = [[random.uniform(-1, 1) for _ in range(dim)] for _ in range(rows)]

import random
vecs = [[random.uniform(-1, 1) for _ in range(dim)] for _ in range(rows)]
print(vecs[0])  #note we use the vec for the search

vec_dic = {
  "id":vec_ids,
  "emb":vecs
}

from pandas import DataFrame
df_vec = DataFrame(vec_dic)
df_vec.to_parquet("output.parquet")

For vss search:

create table vsstst(id Bigint, emb float[128]);
insert into vsstst SELECT * FROM read_parquet('output.parquet') where id < 300000;
CREATE INDEX vss_hnsw_index ON vsstst USING HNSW (emb);
SELECT emb FROM vsstst ORDER BY array_distance(emb, vecs[0]::FLOAT[128]) LIMIT 10;

Note we use the first vector for the search.

JAicewizard commented 4 months ago

Probably a dup of #16

YJGit commented 3 months ago

Yes, I found it. And is there any suggestion?

Maxxen commented 3 months ago

Hello! thanks for the great reproducer script! I've spent some time looking into this and I think I have a fix in progress. I'll keep you updated once I know more.

Maxxen commented 4 weeks ago

This should now be fixed when using the latest nightly build of DuckDB, and will be fixed in the upcoming DuckDB v1.1 (scheduled for release next week)

duckdb / duckdb_vss