NCATComp410 / comp410_summer_2023

Repository for COMP-410 summer 2023
GNU General Public License v3.0
0 stars 3 forks source link

Missing Name Detections #35

Open ncholliday opened 1 year ago

ncholliday commented 1 year ago

The names in line 7 and line 12 are not detected when the pii_scan is run.

ncholliday commented 1 year ago

I think it is not detecting the names because they are in all caps. You have to create a test for a name in caps.

jrbonds commented 1 year ago

Okay, I'll get it done

natalyapolite1 commented 1 year ago

@ncholliday I agree 100% with the resolution. This is an excellent point.

adomakk commented 1 year ago

Good work @ncholliday, I concur with your resolution.

adomakk commented 1 year ago

@claesmk could you assist us ? Person is the entity used but it's not anonymizing the names at the beginning of the document that are in all cap.

jrbonds commented 1 year ago

I've created a branch for the edit and I'm trying to commit but I believe I'm having connection issues because it's taking abnormally long to commit the charges and create a pull request.

Worst come to worst, I will try again in the morning at work. @claesmk

claesmk commented 1 year ago

Anonymizing names is always going to be tricky. Success depends on if the model, in this case en_core_web_lg, has been trained on all the names you plan to recognize.

I used ChatGPT to generate the fake legal brief for this assignment, and it actually did the correct thing - the names in the beginning of a brief like this typically do appear in all caps. I was actually surprised to see that the names did not get detected correctly. In software engineering it's important to look at what went wrong. In this case there was never a requirement to detect names in all caps. It also would have been hard to come up with a test case without that requirement.

In this case the issue is en_core_web_lg has not been trained to recognize names in all caps. If you ask a more sophisticated model such as ChatGPT a few questions, the responses are interesting:

Do you think "MARY PETERSEN" is someone's name?

Yes, "MARY PETERSEN" could be someone's name. Names can vary greatly, and "MARY PETERSEN" follows the format of a typical first name followed by a surname. However, without further context, it is difficult to determine if "MARY PETERSEN" refers to a specific individual.

What about "MARY J. BLIGE"

Yes, "Mary J. Blige" is a well-known name and refers to the American singer, songwriter, and actress Mary Jane Blige. She is a highly successful and influential figure in the music industry, particularly in the genres of R&B, soul, and hip-hop. Mary J. Blige has released numerous hit songs and albums throughout her career and has received multiple awards and accolades for her contributions to music.

What's interesting is that ChatGPT even changed the capitalization in the response for a name it knew, vs one it did not.

Retraining en_core_web_lg to detect upper case names is beyond the scope of this assignment. A reasonable alternative is to locate these specific names in the text and convert them to a case format which en_core_web_lg will work with. If you were working with a real law firm it would be reasonable to request a list of clients to make sure they all get detected correctly so you don't leak any PII. Something like this:

def anonymize_text(text: str):
    """Anonymize text using Presidio"""
    # Presidio does not handle uppercase names well, so we will convert these known names to lowercase
    client_list = ['Mary', 'Petersen', 'John', 'Doe']
    # replace any known uppercase names with the lowercase version
    for n in client_list:
        if n.upper() in text:
            text = text.replace(n.upper(), n)

    engine = AnonymizerEngine()

    # First analyze the text to be anonymized
    results = analyze_text(text)

    # Now anonymize the text
    anon = engine.anonymize(
        text=text,
        analyzer_results=results
    )
    return anon
claesmk commented 1 year ago

@jrbonds are you planning to implement this?