SamHollings / output_checker

Output Checker project
MIT License
5 stars 0 forks source link

Entity recognition on string #20

Open AdamHollings opened 1 year ago

AdamHollings commented 1 year ago

make a function which:

Spacy would be a good place to start

AdamHollings commented 1 year ago

Image

AdamHollings commented 1 year ago

To check for personally identifiable information (PII) in a string using spaCy, you can use the Matcher class and define a pattern to match PII entities such as names, emails, phone numbers, etc.

Here is an example of how you can do this:

import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Initialize the Matcher with the shared vocabulary
matcher = spacy.Matcher(nlp.vocab)

# Define the pattern to match emails
pattern = [{"TYPE": "EMAIL"}]

# Add the pattern to the matcher
matcher.add("EMAIL_PATTERN", None, pattern)

# Define a function to extract the matched emails
def extract_emails(doc):
    emails = []
    # Iterate over the matches
    for match_id, start, end in matcher(doc):
        # Get the matched span
        span = doc[start:end]
        # Append the matched email to the list
        emails.append(span.text)
    return emails

# Define the input string
string = "My email is example@gmail.com and my friend's email is friend@yahoo.com"

# Create a spaCy document from the string
doc = nlp(string)

# Extract the emails from the document
emails = extract_emails(doc)

# Print the emails
print(emails)
AdamHollings commented 1 year ago

Yes, the spaCy Matcher can search for multiple patterns at the same time. You can define multiple patterns and add them to the matcher using the add method.

Here is an example of how you can use the Matcher to search for multiple patterns in a string:

import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Initialize the Matcher with the shared vocabulary
matcher = spacy.Matcher(nlp.vocab)

# Define the patterns to match emails and phone numbers
patterns = [{"TYPE": "EMAIL"}, {"TYPE": "PHONE"}]

# Add the patterns to the matcher
matcher.add("EMAIL_PHONE_PATTERN", None, *patterns)

# Define a function to extract the matched entities
def extract_entities(doc):
    entities = []
    # Iterate over the matches
    for match_id, start, end in matcher(doc):
        # Get the matched span
        span = doc[start:end]
        # Append the matched entity to the list
        entities.append(span.text)
    return entities

# Define the input string
string = "My email is example@gmail.com and my phone number is 123-456-7890."

# Create a spaCy document from the string
doc = nlp(string)

# Extract the entities from the document
entities = extract_entities(doc)

# Print the entities
print(entities)
AdamHollings commented 1 year ago

Yes, you can use multiple language models in spaCy at the same time by loading them separately and using them to process the text.

Here is an example of how you can do this:

import spacy

# Load the English and German models
nlp_en = spacy.load("en_core_web_sm")
nlp_de = spacy.load("de_core_news_sm")

# Define a function to process text in English
def process_en(text):
    # Create a spaCy document from the text
    doc = nlp_en(text)
    # Print the language and the entities in the text
    print(f"Language: {doc.lang}")
    print("Entities:")
    for ent in doc.ents:
        print(ent.text, ent.label_)

# Define a function to process text in German
def process_de(text):
    # Create a spaCy document from the text
    doc = nlp_de(text)
    # Print the language and the entities in the text
    print(f"Language: {doc.lang}")
    print("Entities:")
    for ent in doc.ents:
        print(ent.text, ent.label_)

# Process some text in English
process_en("My name is John and I live in New York.")

# Process some text in German
process_de("Mein Name ist John und ich wohne in Berlin.")
SamHollings commented 1 year ago

Yes, you can use multiple language models in spaCy at the same time by loading them separately and using them to process the text.

Here is an example of how you can do this:

import spacy

# Load the English and German models
nlp_en = spacy.load("en_core_web_sm")
nlp_de = spacy.load("de_core_news_sm")

# Define a function to process text in English
def process_en(text):
    # Create a spaCy document from the text
    doc = nlp_en(text)
    # Print the language and the entities in the text
    print(f"Language: {doc.lang}")
    print("Entities:")
    for ent in doc.ents:
        print(ent.text, ent.label_)

# Define a function to process text in German
def process_de(text):
    # Create a spaCy document from the text
    doc = nlp_de(text)
    # Print the language and the entities in the text
    print(f"Language: {doc.lang}")
    print("Entities:")
    for ent in doc.ents:
        print(ent.text, ent.label_)

# Process some text in English
process_en("My name is John and I live in New York.")

# Process some text in German
process_de("Mein Name ist John und ich wohne in Berlin.")

this seems like to way to go initially. Keep it simple.

Taking the key part of this, we can make get a dictionary of the counts of entities like so:

import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

ent_counts_dict = pd.Series([ent.label_ for ent in doc.ents]).value_counts().to_dict()
ent_counts_dict
AdamHollings commented 1 year ago

The matcher seems the best way to go. You can define what things it should look for then it can provide a list of those, rather than just saying every entity that occurs in the string

AdamHollings commented 1 year ago

Have added the function to SRC in the entity recognition branch. Followed basically the same format as your suggestion above, although I think listing all entities maybe isn't as directly useful as just finding relevant entities.