Request for Info - Githubissues

nikhilkomakula commented 1 month ago

Hello Sai Kiran,

I did came across your medium blog on "Advanced PII Detection in Educational Data Using BERT and ELECTRA". Thanks for the great article. Going through it was a good learning experience. Currently I am trying to execute the code myself to get familiar with it.

As per the Kaggle dataset, there are only 2 json files (train/test) but in your Jupyter notebooks on GitHub, I see references to few .csv files which I don't think are available on GitHub or on Kaggle.

Also, couple of questions regarding "PII Detection using BERT.ipynb":

The original kaggle dataset (train.json) has only ~7K datapoints but in your notebook, I found there are over ~11K datapoints. May I know the source for the additional data?
Would you be able to provide any reference code on how we can do inference using the saved model to detect PII information given a plain english statement or a paragraph as an input.

Thanks in advance.

VellankiSaiKiran commented 1 month ago

Hello Nikhil,

Thank you for your kind words and for going through the blog! I’m glad you found it useful. Let me address your questions:

Regarding the dataset: The additional data you mentioned is actually from a second dataset available on Kaggle. I combined this dataset with the original one to create a larger, more comprehensive dataset. Given the limitations of the small dataset, I believed joining the two would make the model more effective. Due to file size limitations, I wasn’t able to upload the combined dataset to GitHub. However, I can share it with you via a Google Drive link if that would help.

Drive link

Inference using the saved model: I’d be happy to provide reference code for running inference using the saved BERT model to detect PII from plain text inputs. Here’s a basic example to get you started:

`import torch from transformers import BertTokenizer, BertForSequenceClassification

Load the model and tokenizer

model = BertForSequenceClassification.from_pretrained('path_to_saved_model') tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Function to perform inference

def predict_pii(text): inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128) outputs = model(**inputs) logits = outputs.logits predicted_class = torch.argmax(logits, dim=1).item() return predicted_class

Example usage

text = "Your example sentence or paragraph goes here." prediction = predict_pii(text) print(f"PII detected: {bool(prediction)}")`

Thanks!

nikhilkomakula commented 1 month ago

Thanks for your prompt response @VellankiSaiKiran.

Couple of follow-up questions:

Could you provide the direct link to Kaggle for the second dataset that you used.
You mentioned in your article that you were able to achieve 99% using ELECTRA. Wondering why did you use BERT again? Is it just for experimentation?

Thanks again.

VellankiSaiKiran commented 1 month ago

Thanks for your follow-up questions! @nikhilkomakula

Unfortunately, I found the second dataset through one of the notebooks shared on Kaggle under the same competition, but I didn’t save the original link. I recommend checking the notebooks section in the competition for relevant datasets.
As for using BERT, we wanted to experiment with different models and see how their performance compares. ELECTRA achieved excellent results, but trying BERT was part of exploring the robustness of different architectures.

Thanks

VellankiSaiKiran / PII-Detection-Using-BERT-and-ELECTRA

Request for Info #1

Load the model and tokenizer

Function to perform inference

Example usage