duckdb / duckdb_vss

MIT License
73 stars 8 forks source link

VSS bug: "Could not find node in column segment tree!" when adding about 300,000 rows. #19

Closed YJGit closed 4 weeks ago

YJGit commented 4 months ago

What happens?

I have built duckdb 0.10.3 from source on arm centos8, and tried the VSS extension of duckdb. When I add 100,000 rows, it works fine. However, when I add about 300,000 rows, it throws an error: "Could not find node in column segment tree! Attempting to find row number '31578880' in 107 nodes" when I run the select command. Is there some advices?

Note: I also tried on x86 linux, and get the same error.

To Reproduce

Here are the code to reproduce:

create table vsstst(id Bigint, emb float[128]);
insert into vsstst SELECT * FROM read_parquet('output.parquet') where id < 300000;
CREATE INDEX vss_hnsw_index ON vsstst USING HNSW (emb);
SELECT emb FROM vsstst ORDER BY array_distance(emb, [0.0, 16.0, 35.0, 5.0, 32.0, 31.0, 14.0, 10.0, 11.0, 78.0, 55.0, 10.0, 45.0, 83.0, 11.0, 6.0, 14.0, 57.0, 102.0, 75.0, 20.0, 8.0, 3.0, 5.0, 67.0, 17.0, 19.0, 26.0, 5.0, 0.0, 1.0, 22.0, 60.0, 26.0, 7.0, 1.0, 18.0, 22.0, 84.0, 53.0, 85.0, 119.0, 119.0, 4.0, 24.0, 18.0, 7.0, 7.0, 1.0, 81.0, 106.0, 102.0, 72.0, 30.0, 6.0, 0.0, 9.0, 1.0, 9.0, 119.0, 72.0, 1.0, 4.0, 33.0, 119.0, 29.0, 6.0, 1.0, 0.0, 1.0, 14.0, 52.0, 119.0, 30.0, 3.0, 0.0, 0.0, 55.0, 92.0, 111.0, 2.0, 5.0, 4.0, 9.0, 22.0, 89.0, 96.0, 14.0, 1.0, 0.0, 1.0, 82.0, 59.0, 16.0, 20.0, 5.0, 25.0, 14.0, 11.0, 4.0, 0.0, 0.0, 1.0, 26.0, 47.0, 23.0, 4.0, 0.0, 0.0, 4.0, 38.0, 83.0, 30.0, 14.0, 9.0, 4.0, 9.0, 17.0, 23.0, 41.0, 0.0, 0.0, 2.0, 8.0, 19.0, 25.0, 23.0, 1.0]::FLOAT[128]) LIMIT 10;

Note: the data is randomly generated.

OS:

aarch64

DuckDB Version:

0.10.3

DuckDB Client:

use ./duckdb to run

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

No - Other reason (please specify in the issue body)

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

szarnyasg commented 4 months ago

Hi @YJGit, thanks for opening this issue. Could you please provide a script to create the output.parquet file or share the file with us via other means (e.g. wetransfer)?

YJGit commented 4 months ago

Thanks for your replay, here is the complete code: For data generation:

dim = 128
rows = 500000
vec_ids = [i for i in range(rows)]
vecs = [[random.uniform(-1, 1) for _ in range(dim)] for _ in range(rows)]

import random
vecs = [[random.uniform(-1, 1) for _ in range(dim)] for _ in range(rows)]
print(vecs[0])  #note we use the vec for the search

vec_dic = {
  "id":vec_ids,
  "emb":vecs
}

from pandas import DataFrame
df_vec = DataFrame(vec_dic)
df_vec.to_parquet("output.parquet")

For vss search:

create table vsstst(id Bigint, emb float[128]);
insert into vsstst SELECT * FROM read_parquet('output.parquet') where id < 300000;
CREATE INDEX vss_hnsw_index ON vsstst USING HNSW (emb);
SELECT emb FROM vsstst ORDER BY array_distance(emb, vecs[0]::FLOAT[128]) LIMIT 10;

Note we use the first vector for the search.

JAicewizard commented 4 months ago

Probably a dup of #16

YJGit commented 3 months ago

Yes, I found it. And is there any suggestion?

Maxxen commented 3 months ago

Hello! thanks for the great reproducer script! I've spent some time looking into this and I think I have a fix in progress. I'll keep you updated once I know more.

Maxxen commented 4 weeks ago

This should now be fixed when using the latest nightly build of DuckDB, and will be fixed in the upcoming DuckDB v1.1 (scheduled for release next week)