Open abesolberg opened 2 years ago
Thank you for reporting this issue. I'm happy to announce that it should soon work.
I have a working update in https://github.com/computationalprivacy/CorrectMatch.jl/tree/upgrading-deps that now passes all the tests on Julia 1.7 and macOS (https://github.com/computationalprivacy/CorrectMatch.jl/runs/7675628441)
Hi @abesolberg, would you be able to test with the new release? This should be fixed now.
Thanks for looking into this, @cynddl. I am no longer getting the error message, but the individual_uniqueness function is only outputting 0.0. I'm on Windows 10.
Could this be the normal output to expect for your datasets? The tests pass as usual for individual_uniqueness
, so I'm hoping the code works well.
@cynddl I don't believe so. It's outputting zero for the readme example, and for a number of other datasets I've tested it with.
We're getting there, I just pushed a new version on the master branch. :) Seems like it now works on some architectures and not others: https://github.com/computationalprivacy/CorrectMatch.jl/actions/runs/2805373612
That worked! Thank you so much for sorting this out. Really appreciate it. Thanks again.
Hi @cynddl, sorry to reopen this, but it looks like the new functions are still a little funky. I have been exploring further using the example script provided in the examples/demonstration-notebook.ipynb folder. Version 1.1 of CorrectMatch appears to be overestimating the individual_uniqueness of the non-unique individual.
indiv = data[12,:] 6-element Vector{Int64}: 30 1 7 0 2 1
shifted_indiv = indiv - minimum(data , dims = 1)[:] .+1 6-element Vector{Int64}: 14 , 2 , 8 , 1 , 3 , 2
individual_uniqueness(G , shifted_indiv , N) 0.9999982449923798
I'll add, I'm not 100% sure if I'm doing the shift correctly. Julia syntax appears to have changed a bit since the example was written, so I had to add dims = 1
and .+
to the shift, and this might be the cause, but I don't think it is. I've been exploring with some other datasets, and it does seem like the individual uniqueness is being significantly overestimated for relatively non-distinct observations.
Thanks for the report @abesolberg! I'll try to have a look soon.
Hi @cynddl, sorry to reopen this, but it looks like the new functions are still a little funky. I have been exploring further using the example script provided in the examples/demonstration-notebook.ipynb folder. Version 1.1 of CorrectMatch appears to be overestimating the individual_uniqueness of the non-unique individual.
indiv = data[12,:] 6-element Vector{Int64}: 30 1 7 0 2 1
shifted_indiv = indiv - minimum(data , dims = 1)[:] .+1 6-element Vector{Int64}: 14 , 2 , 8 , 1 , 3 , 2
individual_uniqueness(G , shifted_indiv , N) 0.9999982449923798
I'll add, I'm not 100% sure if I'm doing the shift correctly. Julia syntax appears to have changed a bit since the example was written, so I had to add
dims = 1
and.+
to the shift, and this might be the cause, but I don't think it is. I've been exploring with some other datasets, and it does seem like the individual uniqueness is being significantly overestimated for relatively non-distinct observations.
Hello @cynddl and @abesolberg . Trying to get this to run myself. I'm encountering the same behavior mentioned above. Overestimating the individual_uniqueness of the unique individuals. Did any of you find a solution to the problem by chance?
Thanks in advance :)
Hi @NikolaiKorti, thanks for your message. Could you please share a small reproducible code example for me to check and understand what you were expected from individual_uniqueness(·)?
Hey. Thanks for the quick reply @cynddl . I've been trying multiple things. First of all re creating the notebook.
Specifically the section Unlikely unique individual
Just like @abesolberg I had to change the line with the shift, add dims = 1
and .+
in the line with the shift. And I get a number in the 0.99 region where your example was 0.0002859441553556916
Also, running this small example:
using CorrectMatch
A = [1 1 1; 1 1 1; 1 1 1; 1 1 1; 1 1 1; 1 1 1; 2 3 4]
println(uniqueness(A)) #0.14285714285714285
G = fit_mle(GaussianCopula, A)
println(individual_uniqueness(G, [1, 1, 1], 7)) #0.9948205521670765
println(individual_uniqueness(G, [5, 6, 7], 7)) #1.0
println(individual_uniqueness(G, [2, 3, 4], 7)) #0.999999879258077
The individual_uniqueness for [1,1,1] it outputs is 0.99, which seems incorrect. Should be a very ununique individual.
Using julia version 1.6.7
Another little note: Test are running fine. When data only contains [1 1 1] entrys, the value for individual_uniqueness is correctly 0.0:
using CorrectMatch
A = [1 1 1; 1 1 1; 1 1 1; 1 1 1; 1 1 1; 1 1 1]
println(uniqueness(A)) #0.0
G = fit_mle(GaussianCopula, A)
println(individual_uniqueness(G, [1, 1, 1], 6)) #0.0
I tried the same using julia version 1.7.0 resulting in the same behavior.
You will get the result you expect using:
G = fit_mle(GaussianCopula, A; exact_marginal=true)
Without setting _exactmarginal, the code tries to fit whatever distribution works best for each marginal, which is due to fail on such a small dataset.
Thanks for the quick replay again @cynddl . For the example this worked indeed. However I am still having problem re-creating the example in the demonstration-notebook.
I needed to make some adjustments to the code to be able to run it at all. This is what I execute translated to a file:
using CorrectMatch
using StatsBase
using CSV
using DataFrames
using Distributions
df = CSV.read(open("adults.csv"), DataFrame)
df_sub = df[:,[:age, :sex, :workclass, :relationship, Symbol("marital-status"), :race]];
data = Array{Int}(df_sub)
N, M = size(data)
function extract_marginal_ordered(row::AbstractVector)
cm = collect(values(countmap(row; alg=:dict)))
Categorical(cm / sum(cm))
end
marginals = [extract_marginal_ordered(data[:, i]) for i=1:M];
G = fit_mle(GaussianCopula, marginals, data);
indiv = data[1, :] # 39 years old male with non Asian/Black/White race
print("Likely unique individual: ")
println(indiv)
shifted_indiv = indiv - minimum(data, dims=1)[:] .+ 1
print("Likely unique individual uniqueness-score: ")
println(individual_uniqueness(G, shifted_indiv, N))
indiv = data[12, :] # 39 years old male with non Asian/Black/White race
print("Unlikely unique individual: ")
println(indiv)
shifted_indiv = indiv - minimum(data, dims=1)[:] .+ 1
print("Unlikely unique individual uniqueness-score: ")
println(individual_uniqueness(G, shifted_indiv, N))
And this is the output I get:
Likely unique individual: [39, 1, 7, 1, 4, 4]
Likely unique individual uniqueness-score: 0.9999922379260358
Unlikely unique individual: [30, 1, 7, 0, 2, 1]
Unlikely unique individual uniqueness-score: 0.9991543217638105
In your notebook the unlikely individual at the bottom receives a uniqueness score of 0.0002859441553556916
I tried setting exact_marginal=false as well and tried a fit on data instead of marginals and using the not shifted indiv. Always comming up with a high score for the unlikely unique individual.
Just leaving it here if anyone else runs into the same problem: Downgrading to Julia 1.3.1 solved above mentioned problem for me. I was able to run the example notebook and calculate expected individual uniqueness scores correctly.
I am unable to run the individual_uniqueness function, and am getting the error Could not load library "C:\Users\myuser.julia\packages\CorrectMatch\3WzMC\src..\deps\builds\mvndst" the specified module could not be found.
Yet
isfile("C:\Users\myuser\.julia\packages\CorrectMatch\3WzMC\src\..\deps\builds\mvndst")
is true, so the module should be there. I am using Julia Version 1.7.3 and gfortran (GCC) version 11.3.0.Thanks in advance