Closed dat-lequoc closed 2 years ago
In the condition functions which categorize whether a DateTime point is at Home, Work, Activity or not, the code use pandas.Series.dt.dayofweek to know if it's Monday, Tuesday, ... or Sunday. ( {Monday=0; Tuesday=1; ... ; Sunday=6).
You're right! Thanks for pointing out
The code joins anonymized table and original one by using "ID" column, here. It's contradicts the fact that we disturb the IDs column ourself.
I don't think this is a bug, and more of a feature. If the ID changes, their POI are lost in some degree.
I don't think this is a bug, and more of a feature. If the ID changes, their POI are lost in some degree.
Thank you for your response. I think it's evident to understand that but, it isn't the politic of the competition. I mean, we have to change the column "IDs" because it's "identifiant" (or "quasi-identifiant" if you want) according to the presentation.
If we disturb IDs in this case, it gives 0.0 (or quasi-0) for the POI utility, which isn't the purpose of this utility - anonymize data by keeping interesting points, not IDs!
By the way, I have an another question about this.
For example: on Thursday morning, we have 2 data points of a person who stays at home: X, Y, DateTime: 1 1 05:30:00 1 1 06:01:00
Could you tell me the time that this person spent at the point (1, 1)? And could you explain a bit about it? Thanks in advance.
@danymat Thanks for the reply. To me, we cannot determine the POIs of a particular individual (i.e. a particular ID) in the anonymized DB by definition (else the DB wouldn't be anonymized). Thus I was under the impression that this is not what the POI metric should measure. I was expecting that it does measure, within a week either:
"Given an individual in the ground truth, is there a pseudonym in the anonymized DB with the same POI (or with close POI)" "Given a pseudonym in the anonymized DB how close am I POI-wise from the individual corresponding to the pseudonym?"
Or maybe even "Given a map associating each ID to a pseudoID/week maximizing POI closeness between the associated IDs, sum/average/compute std over the distance of each couples"
Whatever we are measuring, I feel like a pseudonimized DB (that hasn't been modified except for pseudo-anonymization) with a bijection from the space of couples (ID,week) from the ground truth to the space of pseudoID in the DB should have a utility of 1 POI-wise since the POI are very much preserved.
Since this is not a implementation issue but an issue with the metric I guess we should ask @beng-git
Cheers
This metric was developped by Antoine Boutet, I will ask him to answer here. However, I think this is an implementation problem, since indeed the join must not be done on the columns but on the line number. I will check the code.
The same problem seems to be present in utility_meet
The same problem seems to be present in utility_meet
We have the same issue for utility_tuile too. I think it isn't necessary to use ID for calculating this metric. https://github.com/danymat/INSAnonym-utils/blob/f92f36518bd74f48b3b3ef08fd17a6786c9e53d6/metrics/utility_tuile.py#L19-L23
yes I also saw the problem with this metric. As posted in https://github.com/danymat/INSAnonym-utils/discussions/88 dateUtil, hourUtil and utility_distance should be ok (with a little remark that utility_distance gives a value of 1 if the subset of non DEL lines is correct)
Can you test this script ?
https://benjamin-nguyen.fr/DARC/utility_POI_nodf.py
You will also need this library : https://benjamin-nguyen.fr/DARC/Utils.py
I have tried with the following files : Original file
1 2015-01-01 23:27:31 1 1
1 2015-01-01 23:27:33 1 1
1 2015-01-01 23:27:35 1 1
Anon file
2 2015-01-01 23:27:31 1 1
2 2015-01-01 23:27:33 1 1
2 2015-01-01 23:27:35 1 1
Score = 1
Anon file
2 2015-01-01 23:27:31 1 1
2 2015-01-01 23:27:33 1 1
DEL 2015-01-01 23:27:35 1 1
Score = 0.5
This is because the score takes into account the duration spent in a POI : if both durations (original and anon) are the same, the score is 1. If the duration is smaller or longer in one or the other case, then the score is proportional to the ratio of the durations (here the duration of original is 4 sec and the duration of anon is 2 sec, thus the score is 0.5). If a POI is detected with duration 0 whereas in fact the duration is not 0 then the score is 0.
See https://github.com/danymat/INSAnonym-utils/discussions/88#discussioncomment-1621687 for the links to the 3 metrics.
Temporary fix for the issue : the production server has been rolled back to version 2.3.1 with a different implementation of the metrics without dataframes. Correcting the metrics will be done at a later stage (possibly by one group during the second part of the project at INSA CVL)
Hi, I tried to understand how POI works but I just can't. I hope you can give me some explanations @ceichler @beng-git. Thanks! Here are some examples :
What scripts are you using? According to previous comment it seems to me that 1 & 2 should return 1. edit: ^ this wasn't true I read the DB too quickly and didn't realize the IDs were all different (see answer bellow)
I used the latest code, which Mr. Nguyen mentioned above.
I expected that it returns 1 too, but it wasn't the case.
I have two more questions.
The "DEL" row will be modified by the server before it public anonymized data for the attack phrase? Or does the server shuffle just rows without changing these lines?
Could you upload the script which checks the exception below, please?
I have two more questions.
1. The "DEL" row will be modified by the server before it public anonymized data for the attack phrase? Or does the server shuffle just rows without changing these lines? 2. Could you upload the script which checks the exception below, please? ![image](https://user-images.githubusercontent.com/71183203/142756910-f099b316-159a-4f49-a665-782861e23293.png)
Please open dedicated discussions, this is not relevant to the current issue
The POI utility relies on time spent in a tuile and in your first example an individual is never twice in the same tuile.
Practically, I believe that this means that diff_time(key, date_time, last_date_original_tab) is always 0 and your tables contains only datetime.timedelta(0) . While computing score, this means that we are always in the case:
if time_second_original==0 and time_second_original==0:
continue
And the score stays at 0.
Theoretically, I reckon this means that it is impossible to compute POI in neither your original nor your anonymized DB. I thus do not think it is an implementation issue but rather a violation of the hypothesis on the orginal DB (it should allow the computation of POI for the utility related to the preservation of POI to make sense).
I suspect the other issues are similar. Let me know.
Best
Please open dedicated discussions, this is not relevant to the current issue
I'm sorry, I didn't think about it. Please find it at discussion #95 .
Theoretically, I reckon this means that it is impossible to compute POI in neither your original nor your anonymized DB. I thus do not think it is an implementation issue but rather a violation of the hypothesis on the orginal DB (it should allow the computation of POI for the utility related to the preservation of POI to make sense).
That makes sense. Thanks for your quick response.
Hi, I had found some "bugs" in the calculation of POIs and I reported to my instructor @ceichler yesterday. In this issue, I will explain in more detail.
1
In the condition functions which categorize whether a DateTime point is at Home, Work, Activity or not, the code use pandas.Series.dt.dayofweek to know if it's Monday, Tuesday, ... or Sunday. ( {Monday=0; Tuesday=1; ... ; Sunday=6).
Normally, a weekend is composed of Saturday and Sunday. So, I think it should be :
Solution :
Proof : (note that 2015-01-01 is thursday!)
2
The code joins anonymized table and original one by using "ID" column, here. It's contradicts the fact that we disturb the IDs column ourself.
Proof :
The 2 tables are identical, I just changed the IDs, and it gives a zero.
Resolution
I look forward to your feedback @ceichler @beng-git @danymat . Thanks!