Hi - I noticed an unexpected ordering of x,y coordinates when converting the output of all_pairs to an identity matrix. Here's the behavior that I see with identical sets:
import numpy as np
from SetSimilaritySearch import all_pairs
import random
nsets = 10
population = list(range(100))
sets = [set(population) for i in range(nsets)]
coords = all_pairs(sets, similarity_threshold=0)
arr = np.nan * np.empty((nsets, nsets))
x, y, z = zip(*coords)
arr[x, y] = z
print(np.round(arr, 2))
The output is a nice lower-triangular matrix.
[[nan nan nan nan nan nan nan nan nan nan]
[ 1. nan nan nan nan nan nan nan nan nan]
[ 1. 1. nan nan nan nan nan nan nan nan]
[ 1. 1. 1. nan nan nan nan nan nan nan]
[ 1. 1. 1. 1. nan nan nan nan nan nan]
[ 1. 1. 1. 1. 1. nan nan nan nan nan]
[ 1. 1. 1. 1. 1. 1. nan nan nan nan]
[ 1. 1. 1. 1. 1. 1. 1. nan nan nan]
[ 1. 1. 1. 1. 1. 1. 1. 1. nan nan]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. nan]]
But when the sets are not identical, the x and y indices seem to be ordered arbitrarily:
sets = [set(population) - set(random.choices(population, k=10)) for i in range(nsets)]
coords = all_pairs(sets, similarity_threshold=0)
arr = np.nan * np.empty((nsets, nsets))
x, y, z = zip(*coords)
arr[x, y] = z
print(np.round(arr, 2))
[[ nan nan nan nan nan nan nan nan nan nan]
[0.83 nan 0.83 0.81 0.81 nan 0.81 nan 0.87 nan]
[0.84 nan nan nan nan nan nan nan nan nan]
[0.8 nan 0.8 nan nan nan nan nan nan nan]
[0.84 nan 0.8 0.82 nan nan nan nan nan nan]
[0.84 0.83 0.82 0.82 0.84 nan 0.84 0.83 0.82 nan]
[0.84 nan 0.84 0.82 0.8 nan nan nan nan nan]
[0.81 0.84 0.83 0.87 0.81 nan 0.87 nan 0.83 nan]
[0.82 nan 0.82 0.82 0.82 nan 0.82 nan nan nan]
[0.82 0.87 0.82 0.86 0.84 0.84 0.84 0.85 0.82 nan]]
I can restore the lower-triangular matrix by adding the following line before assigning to the array:
x, y = zip(*[sorted(pair) for pair in zip(x, y)])
So I can still accomplish what I want with minimal difficulty, but I thought I'd let you know because it seems like generating an identity matrix might be a common use case and the behavior is a bit surprising.
Hi - I noticed an unexpected ordering of x,y coordinates when converting the output of
all_pairs
to an identity matrix. Here's the behavior that I see with identical sets:The output is a nice lower-triangular matrix.
But when the sets are not identical, the x and y indices seem to be ordered arbitrarily:
I can restore the lower-triangular matrix by adding the following line before assigning to the array:
So I can still accomplish what I want with minimal difficulty, but I thought I'd let you know because it seems like generating an identity matrix might be a common use case and the behavior is a bit surprising.
Thanks a lot for sharing this project!