PolicyEngine / synthimpute

Python package for data synthesis and imputation using parametric and nonparametric methods, and evaluation of these methods.
MIT License
11 stars 6 forks source link

Add distance option to get 2nd and 3rd (or kth) nearest record #23

Open MaxGhenis opened 5 years ago

MaxGhenis commented 5 years ago

Rather than only the current minimum.

From https://www.irs.gov/pub/irs-soi/07rppsweber.pdf see

With the distance-based algorithm, protection against reidentification is measured in terms of the number of PUF records that lie at least as close to a record from the population as the true match. The minimum protection that is sought is having at least two records that are at least as close to a record from the population as the true match, if the true match is in the PUF.

MaxGhenis commented 5 years ago

np.argpartition can do this: https://stackoverflow.com/a/34226816/1840471

MaxGhenis commented 5 years ago

Largely added but the distances aren't coming out in the right order:

print(nearest[nearest.dist1 > nearest.dist3].shape[0])  # 129
print(nearest[nearest.dist1 < nearest.dist3].shape[0])  # 756

One of these should be zero.

MaxGhenis commented 5 years ago

From numpy.argpartition documentation:

Element index to partition by. The k-th element will be in its final sorted position and all smaller elements will be moved before it and all larger elements behind it. The order all elements in the partitions is undefined. If provided with a sequence of k-th it will partition all of them into their sorted position at once.

So it needs to be re-sorted either within nearest_record_single or at the end (probably faster).