danymat / INSAnonym-utils

Execute anonymization scripts on a table.
https://danymat.github.io/INSAnonym-utils
GNU General Public License v3.0
8 stars 5 forks source link

[Error]: POI Algorithm #86

Closed dat-lequoc closed 2 years ago

dat-lequoc commented 3 years ago

Hi, I had found some "bugs" in the calculation of POIs and I reported to my instructor @ceichler yesterday. In this issue, I will explain in more detail.

1

In the condition functions which categorize whether a DateTime point is at Home, Work, Activity or not, the code use pandas.Series.dt.dayofweek to know if it's Monday, Tuesday, ... or Sunday. ( {Monday=0; Tuesday=1; ... ; Sunday=6).

return [
            (df['date'].dt.dayofweek < 4) & (df.index.isin(df.between_time(night_start, night_end, include_start=False, include_end=False).index)),
            (df['date'].dt.dayofweek < 4) & (df.index.isin(df.between_time(work_start, work_end, include_start=False, include_end=False).index )),
            (df['date'].dt.dayofweek >= 4) & (df.index.isin(df.between_time(weekend_start, weekend_end, include_start=False, include_end=False).index ))
        ]

Normally, a weekend is composed of Saturday and Sunday. So, I think it should be :

Solution :

return [
            (df['date'].dt.dayofweek < 5) & (df.index.isin(df.between_time(night_start, night_end, include_start=False, include_end=False).index)),
            (df['date'].dt.dayofweek < 5) & (df.index.isin(df.between_time(work_start, work_end, include_start=False, include_end=False).index )),
            (df['date'].dt.dayofweek >= 5) & (df.index.isin(df.between_time(weekend_start, weekend_end, include_start=False, include_end=False).index ))
        ]

Proof : (note that 2015-01-01 is thursday!)

image

image

2

The code joins anonymized table and original one by using "ID" column, here. It's contradicts the fact that we disturb the IDs column ourself.

Proof :

The 2 tables are identical, I just changed the IDs, and it gives a zero. image

image

Resolution

  1. We may replace temporarily the ID column of the anonymized table with the real one because rows aren't shuffled.
  2. The server can disturb IDs for each anonymized submission.

I look forward to your feedback @ceichler @beng-git @danymat . Thanks!

danymat commented 3 years ago

In the condition functions which categorize whether a DateTime point is at Home, Work, Activity or not, the code use pandas.Series.dt.dayofweek to know if it's Monday, Tuesday, ... or Sunday. ( {Monday=0; Tuesday=1; ... ; Sunday=6).

You're right! Thanks for pointing out

The code joins anonymized table and original one by using "ID" column, here. It's contradicts the fact that we disturb the IDs column ourself.

I don't think this is a bug, and more of a feature. If the ID changes, their POI are lost in some degree.

dat-lequoc commented 3 years ago

I don't think this is a bug, and more of a feature. If the ID changes, their POI are lost in some degree.

Thank you for your response. I think it's evident to understand that but, it isn't the politic of the competition. I mean, we have to change the column "IDs" because it's "identifiant" (or "quasi-identifiant" if you want) according to the presentation.

If we disturb IDs in this case, it gives 0.0 (or quasi-0) for the POI utility, which isn't the purpose of this utility - anonymize data by keeping interesting points, not IDs!

By the way, I have an another question about this.

For example: on Thursday morning, we have 2 data points of a person who stays at home: X, Y, DateTime: 1 1 05:30:00 1 1 06:01:00

Could you tell me the time that this person spent at the point (1, 1)? And could you explain a bit about it? Thanks in advance.

ceichler commented 3 years ago

@danymat Thanks for the reply. To me, we cannot determine the POIs of a particular individual (i.e. a particular ID) in the anonymized DB by definition (else the DB wouldn't be anonymized). Thus I was under the impression that this is not what the POI metric should measure. I was expecting that it does measure, within a week either:

"Given an individual in the ground truth, is there a pseudonym in the anonymized DB with the same POI (or with close POI)" "Given a pseudonym in the anonymized DB how close am I POI-wise from the individual corresponding to the pseudonym?"

Or maybe even "Given a map associating each ID to a pseudoID/week maximizing POI closeness between the associated IDs, sum/average/compute std over the distance of each couples"

Whatever we are measuring, I feel like a pseudonimized DB (that hasn't been modified except for pseudo-anonymization) with a bijection from the space of couples (ID,week) from the ground truth to the space of pseudoID in the DB should have a utility of 1 POI-wise since the POI are very much preserved.

Since this is not a implementation issue but an issue with the metric I guess we should ask @beng-git

Cheers

beng-git commented 3 years ago

This metric was developped by Antoine Boutet, I will ask him to answer here. However, I think this is an implementation problem, since indeed the join must not be done on the columns but on the line number. I will check the code.

beng-git commented 3 years ago

The same problem seems to be present in utility_meet

dat-lequoc commented 3 years ago

The same problem seems to be present in utility_meet

We have the same issue for utility_tuile too. I think it isn't necessary to use ID for calculating this metric. https://github.com/danymat/INSAnonym-utils/blob/f92f36518bd74f48b3b3ef08fd17a6786c9e53d6/metrics/utility_tuile.py#L19-L23

beng-git commented 3 years ago

yes I also saw the problem with this metric. As posted in https://github.com/danymat/INSAnonym-utils/discussions/88 dateUtil, hourUtil and utility_distance should be ok (with a little remark that utility_distance gives a value of 1 if the subset of non DEL lines is correct)

beng-git commented 3 years ago

Can you test this script ?

https://benjamin-nguyen.fr/DARC/utility_POI_nodf.py

You will also need this library : https://benjamin-nguyen.fr/DARC/Utils.py

I have tried with the following files : Original file

1   2015-01-01 23:27:31 1   1
1   2015-01-01 23:27:33 1   1
1   2015-01-01 23:27:35 1   1

Anon file

2   2015-01-01 23:27:31 1   1
2   2015-01-01 23:27:33 1   1
2   2015-01-01 23:27:35 1   1

Score = 1

Anon file

2   2015-01-01 23:27:31 1   1
2   2015-01-01 23:27:33 1   1
DEL 2015-01-01 23:27:35 1   1

Score = 0.5

This is because the score takes into account the duration spent in a POI : if both durations (original and anon) are the same, the score is 1. If the duration is smaller or longer in one or the other case, then the score is proportional to the ratio of the durations (here the duration of original is 4 sec and the duration of anon is 2 sec, thus the score is 0.5). If a POI is detected with duration 0 whereas in fact the duration is not 0 then the score is 0.

beng-git commented 3 years ago

See https://github.com/danymat/INSAnonym-utils/discussions/88#discussioncomment-1621687 for the links to the 3 metrics.

beng-git commented 3 years ago

Temporary fix for the issue : the production server has been rolled back to version 2.3.1 with a different implementation of the metrics without dataframes. Correcting the metrics will be done at a later stage (possibly by one group during the second part of the project at INSA CVL)

dat-lequoc commented 2 years ago

Hi, I tried to understand how POI works but I just can't. I hope you can give me some explanations @ceichler @beng-git. Thanks! Here are some examples :

  1. image

  2. image

  3. image

ceichler commented 2 years ago

What scripts are you using? According to previous comment it seems to me that 1 & 2 should return 1. edit: ^ this wasn't true I read the DB too quickly and didn't realize the IDs were all different (see answer bellow)

dat-lequoc commented 2 years ago

I used the latest code, which Mr. Nguyen mentioned above.

dat-lequoc commented 2 years ago

I expected that it returns 1 too, but it wasn't the case.

dat-lequoc commented 2 years ago

I have two more questions.

  1. The "DEL" row will be modified by the server before it public anonymized data for the attack phrase? Or does the server shuffle just rows without changing these lines?

  2. Could you upload the script which checks the exception below, please? image

ceichler commented 2 years ago

I have two more questions.

1. The "DEL" row will be modified by the server before it public anonymized data for the attack phrase? Or does the server shuffle just rows without changing these lines?

2. Could you upload the script which checks the exception below, please?
   ![image](https://user-images.githubusercontent.com/71183203/142756910-f099b316-159a-4f49-a665-782861e23293.png)

Please open dedicated discussions, this is not relevant to the current issue

ceichler commented 2 years ago

The POI utility relies on time spent in a tuile and in your first example an individual is never twice in the same tuile.

Practically, I believe that this means that diff_time(key, date_time, last_date_original_tab) is always 0 and your tables contains only datetime.timedelta(0) . While computing score, this means that we are always in the case:

 if time_second_original==0 and time_second_original==0:
                                    continue

And the score stays at 0.

Theoretically, I reckon this means that it is impossible to compute POI in neither your original nor your anonymized DB. I thus do not think it is an implementation issue but rather a violation of the hypothesis on the orginal DB (it should allow the computation of POI for the utility related to the preservation of POI to make sense).

I suspect the other issues are similar. Let me know.

Best

dat-lequoc commented 2 years ago

Please open dedicated discussions, this is not relevant to the current issue

I'm sorry, I didn't think about it. Please find it at discussion #95 .

Theoretically, I reckon this means that it is impossible to compute POI in neither your original nor your anonymized DB. I thus do not think it is an implementation issue but rather a violation of the hypothesis on the orginal DB (it should allow the computation of POI for the utility related to the preservation of POI to make sense).

That makes sense. Thanks for your quick response.