capitalone / DataProfiler

What's in your data? Extract schema, statistics and entities from datasets
https://capitalone.github.io/DataProfiler
Apache License 2.0
1.41k stars 157 forks source link

Issues with Transfer Learning for Default Labeler #1155

Open DylanVig opened 2 months ago

DylanVig commented 2 months ago

I attempted to create my own custom labeler by using the transfer learning example in the documentation. I attempted to add three labels to the labeler: Name, Datetime (which is already a category in the default labeler), and Nationality. This is my code for training my labeler, as well as the csv I used to train it. large_fake_data.csv

Screenshot 2024-06-25 at 3 06 44 PM

I attempted to use this custom labeler on a csv that had 6 columns: Name, Datetime, Phone Number, SSN, Email, and Nationality. When I ran it with the custom labeler, it correctly identified the names, datetimes, and nationalities. However, it also falsely identified the phone numbers, SSNs, and emails incorrectly (usually identifying email and SSN and nationality and phone number as datetime). When I run it with the default labeler, it seems to pick up on those three fields just fine. Is there a problem with how I am programming my labeler, how I'm training it, etc? Here is my code for testing my labeler, as well as the csv I used to get these results: three_cat_labeler_test_data.csv

Screenshot 2024-06-25 at 3 11 02 PM

Thank you!