NCATComp410 / comp410_spring_2024

COMP410 spring 2024 semester
MIT License
3 stars 0 forks source link

Anonymization Quality #66

Closed Josiah-Small closed 7 months ago

Josiah-Small commented 7 months ago

While looking at the case and the results files there was a problem I noticed. The problem I found was the in the case text file. The statement in the case file is: “The customer from Singapore was unable to check in to the hotel due to an incorrect FIN on file. Corrected to G1122144L”. This statement is false according to the results file and the format of the FIN number. The statement said a “customer from Singapore” means that this individual is a citizen of Singapore. If this individual was from Singapore the FIN number would start with an “S” or “T”. Only foreigners with long-term passes from January 1, 2000, to December 31, 2021, are assigned the letter “G”. All of this can be found on Wikipedia.

We could fix how the presidio anonymizer handles partial intersection with the text. Doing so can be accomplished using regular expressions to validate anonymized FIN numbers. These numbers would go against the expected format for different categories depending on whether the individual is a citizen, resident, or foreigner. Another way is by incorporating specific rules and patterns of how the FIN number gets anonymized. Adding prefix conditions for different categories can help improve how the anonymization process functions.

hkhamvan263 commented 7 months ago

I agree that the second solution method is the best solution method because the second solution method improves the functioning of the anonymization process and does NOT go against the expected format for different categories depending on the individual's citizenship/residency status.

hkhamvan263 commented 7 months ago

I ran the scanner and redirected the output to results.txt. Afterwards, I found out that there are no anonymization issues or any other issues found within PERSON.

Jamtyful commented 7 months ago

As far as the SG_NRIC_FIN error goes, a foreigner representing a Singaporian company could also be referred to as a "customer from Singapore". For example, they could be a foreign employee of a Singaporean company that is negotiating a contract with whoever wrote the message.

We can be reasonably sure that the FIN number listed is valid so it being anonymized is desired. Alternatively, the 'G' could have also been a typo for 'T' and the real FIN number could be deduced from that. In that case, the FIN being anonymized is also beneficial.

Jamtyful commented 7 months ago

When I ran the results, I only found that a mis-entered driver's license was anonymized. During the call it was found the ES NIF was set incorrectly. Updated from 5555555K to 12345678Z became: During the call it was found the ES NIF was set incorrectly. Updated from <US_DRIVER_LICENSE> to <ES_NIF>

This could be avoided by requiring further context words to be achieved. However, this could have unintended results in other cases. For example, a customer fills out a field incorrectly with their driver's license number into a different field (i.e. street address). A bad actor could use this information to deduce the incorrect input is the license number of the client. Having this field be anonymized in these cases catches cases of clerical errors like these.

kmwatts1 commented 7 months ago

I agree that using regular expressions is a good method for validating anonymized FIN numbers. Regular expressions provide a flexible and efficient way to ensure that the anonymized numbers adhere to the expected format for each category of individuals.

hkhamvan263 commented 7 months ago

@kmwatts1 I am assuming that EMAIL_ADDRESS does not have any issues after redirecting the output to results.txt.

Josiah-Small commented 7 months ago

@hkhamvan263 When checking for issues the EMAIL_ADDRESS appears to be working correctly. The formats and structure of an email matches the case.txt output.

kmwatts1 commented 7 months ago

@kmwatts1 I am assuming that EMAIL_ADDRESS does not have any issues after redirecting the output to results.txt.

No issues were found with EMAIL_ADDRESS, it was correctly anonymized.