Open AdamHollings opened 1 year ago
To check for personally identifiable information (PII) in a string using spaCy, you can use the Matcher class and define a pattern to match PII entities such as names, emails, phone numbers, etc.
Here is an example of how you can do this:
import spacy
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")
# Initialize the Matcher with the shared vocabulary
matcher = spacy.Matcher(nlp.vocab)
# Define the pattern to match emails
pattern = [{"TYPE": "EMAIL"}]
# Add the pattern to the matcher
matcher.add("EMAIL_PATTERN", None, pattern)
# Define a function to extract the matched emails
def extract_emails(doc):
emails = []
# Iterate over the matches
for match_id, start, end in matcher(doc):
# Get the matched span
span = doc[start:end]
# Append the matched email to the list
emails.append(span.text)
return emails
# Define the input string
string = "My email is example@gmail.com and my friend's email is friend@yahoo.com"
# Create a spaCy document from the string
doc = nlp(string)
# Extract the emails from the document
emails = extract_emails(doc)
# Print the emails
print(emails)
Yes, the spaCy Matcher can search for multiple patterns at the same time. You can define multiple patterns and add them to the matcher using the add method.
Here is an example of how you can use the Matcher to search for multiple patterns in a string:
import spacy
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")
# Initialize the Matcher with the shared vocabulary
matcher = spacy.Matcher(nlp.vocab)
# Define the patterns to match emails and phone numbers
patterns = [{"TYPE": "EMAIL"}, {"TYPE": "PHONE"}]
# Add the patterns to the matcher
matcher.add("EMAIL_PHONE_PATTERN", None, *patterns)
# Define a function to extract the matched entities
def extract_entities(doc):
entities = []
# Iterate over the matches
for match_id, start, end in matcher(doc):
# Get the matched span
span = doc[start:end]
# Append the matched entity to the list
entities.append(span.text)
return entities
# Define the input string
string = "My email is example@gmail.com and my phone number is 123-456-7890."
# Create a spaCy document from the string
doc = nlp(string)
# Extract the entities from the document
entities = extract_entities(doc)
# Print the entities
print(entities)
Yes, you can use multiple language models in spaCy at the same time by loading them separately and using them to process the text.
Here is an example of how you can do this:
import spacy
# Load the English and German models
nlp_en = spacy.load("en_core_web_sm")
nlp_de = spacy.load("de_core_news_sm")
# Define a function to process text in English
def process_en(text):
# Create a spaCy document from the text
doc = nlp_en(text)
# Print the language and the entities in the text
print(f"Language: {doc.lang}")
print("Entities:")
for ent in doc.ents:
print(ent.text, ent.label_)
# Define a function to process text in German
def process_de(text):
# Create a spaCy document from the text
doc = nlp_de(text)
# Print the language and the entities in the text
print(f"Language: {doc.lang}")
print("Entities:")
for ent in doc.ents:
print(ent.text, ent.label_)
# Process some text in English
process_en("My name is John and I live in New York.")
# Process some text in German
process_de("Mein Name ist John und ich wohne in Berlin.")
Yes, you can use multiple language models in spaCy at the same time by loading them separately and using them to process the text.
Here is an example of how you can do this:
import spacy # Load the English and German models nlp_en = spacy.load("en_core_web_sm") nlp_de = spacy.load("de_core_news_sm") # Define a function to process text in English def process_en(text): # Create a spaCy document from the text doc = nlp_en(text) # Print the language and the entities in the text print(f"Language: {doc.lang}") print("Entities:") for ent in doc.ents: print(ent.text, ent.label_) # Define a function to process text in German def process_de(text): # Create a spaCy document from the text doc = nlp_de(text) # Print the language and the entities in the text print(f"Language: {doc.lang}") print("Entities:") for ent in doc.ents: print(ent.text, ent.label_) # Process some text in English process_en("My name is John and I live in New York.") # Process some text in German process_de("Mein Name ist John und ich wohne in Berlin.")
this seems like to way to go initially. Keep it simple.
Taking the key part of this, we can make get a dictionary of the counts of entities like so:
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
ent_counts_dict = pd.Series([ent.label_ for ent in doc.ents]).value_counts().to_dict()
ent_counts_dict
The matcher seems the best way to go. You can define what things it should look for then it can provide a list of those, rather than just saying every entity that occurs in the string
Have added the function to SRC in the entity recognition branch. Followed basically the same format as your suggestion above, although I think listing all entities maybe isn't as directly useful as just finding relevant entities.
make a function which:
Spacy would be a good place to start