erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.73k stars 716 forks source link

How the sparse datasets are stored? #475

Open zlwu92 opened 8 months ago

zlwu92 commented 8 months ago

Hi professor @maumueller ,

As you mentioned the Kosarak and MovieLens-10M are sparse dataset and they are packed like a scipy csr format,

So, when I use this distance function in https://github.com/erikbern/ann-benchmarks/blob/main/ann_benchmarks/distance.py#L104 to get train and test objects which are basically list object, right? np.ndarray.

Then I found that len of each row is not the same, so it still stored in compact style, right?

I guess that does each element in the rows represents the index of the non-zero element in the orignal sparse vector?

f = h5py.File(h5_file, 'r')
train, test = dataset_transform(f)
print(type(train))
print(len(train))
for i in train[0]:
      print(str(i), end=' ')
print()

image