ekzhu / SetSimilaritySearch

All-pair set similarity search on millions of sets in Python and on a laptop
Apache License 2.0
589 stars 40 forks source link

ordering of x,y coordinates when converting to identity matrix #1

Closed nhoffman closed 5 years ago

nhoffman commented 5 years ago

Hi - I noticed an unexpected ordering of x,y coordinates when converting the output of all_pairs to an identity matrix. Here's the behavior that I see with identical sets:

import numpy as np
from SetSimilaritySearch import all_pairs
import random

nsets = 10
population = list(range(100))

sets = [set(population) for i in range(nsets)]
coords = all_pairs(sets, similarity_threshold=0)

arr = np.nan * np.empty((nsets, nsets))
x, y, z = zip(*coords)
arr[x, y] = z
print(np.round(arr, 2))

The output is a nice lower-triangular matrix.

[[nan nan nan nan nan nan nan nan nan nan]
 [ 1. nan nan nan nan nan nan nan nan nan]
 [ 1.  1. nan nan nan nan nan nan nan nan]
 [ 1.  1.  1. nan nan nan nan nan nan nan]
 [ 1.  1.  1.  1. nan nan nan nan nan nan]
 [ 1.  1.  1.  1.  1. nan nan nan nan nan]
 [ 1.  1.  1.  1.  1.  1. nan nan nan nan]
 [ 1.  1.  1.  1.  1.  1.  1. nan nan nan]
 [ 1.  1.  1.  1.  1.  1.  1.  1. nan nan]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1. nan]]

But when the sets are not identical, the x and y indices seem to be ordered arbitrarily:

sets = [set(population) - set(random.choices(population, k=10)) for i in range(nsets)]
coords = all_pairs(sets, similarity_threshold=0)

arr = np.nan * np.empty((nsets, nsets))
x, y, z = zip(*coords)
arr[x, y] = z
print(np.round(arr, 2))
[[ nan  nan  nan  nan  nan  nan  nan  nan  nan  nan]
 [0.83  nan 0.83 0.81 0.81  nan 0.81  nan 0.87  nan]
 [0.84  nan  nan  nan  nan  nan  nan  nan  nan  nan]
 [0.8   nan 0.8   nan  nan  nan  nan  nan  nan  nan]
 [0.84  nan 0.8  0.82  nan  nan  nan  nan  nan  nan]
 [0.84 0.83 0.82 0.82 0.84  nan 0.84 0.83 0.82  nan]
 [0.84  nan 0.84 0.82 0.8   nan  nan  nan  nan  nan]
 [0.81 0.84 0.83 0.87 0.81  nan 0.87  nan 0.83  nan]
 [0.82  nan 0.82 0.82 0.82  nan 0.82  nan  nan  nan]
 [0.82 0.87 0.82 0.86 0.84 0.84 0.84 0.85 0.82  nan]]

I can restore the lower-triangular matrix by adding the following line before assigning to the array:

x, y = zip(*[sorted(pair) for pair in zip(x, y)])

So I can still accomplish what I want with minimal difficulty, but I thought I'd let you know because it seems like generating an identity matrix might be a common use case and the behavior is a bit surprising.

Thanks a lot for sharing this project!

ekzhu commented 5 years ago

Thanks! I just fixed it. This is an interesting use case.