ParticularMiner / red_string_grouper

Record Equivalence Discoverer based on String Grouper
MIT License
4 stars 2 forks source link

Question / suggestion to use multiple n-grams to get more features. #4

Open iibarant opened 2 years ago

iibarant commented 2 years ago

Hi @ParticularMiner,

Hope you are doing good.

I got to work on the same project again and have a question / suggestion - would it be possible to use multiple n-grams to get more features? Like currently we have the following - ngram_size: The amount of characters in each n-gram. Default is 3.

What if we get n-grams in a list like [2,3,4] and get more vector components - ngrams=2 plus ngrams=3 and ngrams=4?

What do you think?

By the way, the string_grouper approach is really good in terms of speed and efficiency. Great work!

Thank you, iibarant

ParticularMiner commented 2 years ago

Hi @iibarant

I'm well thanks! I hope you are too. You're still very busy, it seems.

Regarding your question — I don't fully understand. Perhaps you could give me an example with data and a demonstration of how you would expect multiple n-grams to work.

iibarant commented 2 years ago

Yes, still busy ... Regarding an example: Let's say the text is 'JOHN SMITH' The 3-grams (default)  = ['JOH', 'OHN', 'HN ', 'N S', ' SM', 'SMI', 'MIT', 'ITH'] What if we have 2-grams + 3-grams + 4 grams: ['JO', 'OH', 'HN', 'N ', ' S', 'SM', 'MI', 'IT', 'TH'] + ['JOH', 'OHN', 'HN ', 'N S', ' SM', 'SMI', 'MIT', 'ITH'] + ['JOHN', 'OHN ', 'HN S', 'N SM', ' SMI', 'SMIT', 'MITH']

We would have more components (features): ['JO', 'OH', 'HN', 'N ', ' S', 'SM', 'MI', 'IT', 'TH', 'JOH', 'OHN', 'HN ', 'N S', ' SM', 'SMI', 'MIT', 'ITH', 'JOHN', 'OHN ', 'HN S', 'N SM', ' SMI', 'SMIT', 'MITH'] to get the TF-IDF for.

I believe this would be helpful for short text comparison / matching and hope this question makes sense.

Thank you, iibarant

On Tuesday, November 23, 2021, 01:42:37 p.m. EST, ParticularMiner ***@***.***> wrote:  

Hi @iibarant

I'm well thanks! I hope you are too. You're still very busy, it seems.

Regarding your question — I don't fully understand. Perhaps you could give me an example with data and a demostration of how you would expect multiple n-grams to work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

ParticularMiner commented 2 years ago

Hi @iibarant

See if the latest version (red-string-grouper-0.1.1) does what you want.

To install, execute the following command:

pip install git+https://github.com/ParticularMiner/red_string_grouper.git@master

red-string-grouper should now accept multiple ngram-sizes, such as:

ngram_size=[3, 5]

By the way, any chance I'll get a fitting example using multiple DataFrame's to add to red_string_grouper's README.md?

ParticularMiner commented 2 years ago

@iibarant

I forgot to mention — can you also let me know if the warnings you used to experience are still appearing or not.

iibarant commented 2 years ago

Hi, I will get back to on that on by end of Wednesday, November 24, New York time.

On Tuesday, November 23, 2021, 03:36:55 p.m. EST, ParticularMiner @.***> wrote:

Hi @iibarant

See if the latest version (red-string-grouper-0.1.1) does what you want.

To install, execute the following command:

pip install @.*** red-string-grouper should now accept multiple ngram-sizes, such as:

ngram_size=[3, 5] By the way, any chance I'll get a fitting example using multiple DataFrame's to add to red_string_grouper's README.md?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

ParticularMiner commented 2 years ago

No problem @iibarant .

You've likely already noticed that your email server cleverly masks-out any email addresses found in your messages. So for instance, the "pip install" command I gave you is not displayed as I originally wrote (because of the presence of the @ symbol in it). I think the same thing happened to your own email address in an earlier message.