Open iibarant opened 2 years ago
Hi @iibarant
I'm well thanks! I hope you are too. You're still very busy, it seems.
Regarding your question — I don't fully understand. Perhaps you could give me an example with data and a demonstration of how you would expect multiple n-grams to work.
Yes, still busy ... Regarding an example: Let's say the text is 'JOHN SMITH' The 3-grams (default) = ['JOH', 'OHN', 'HN ', 'N S', ' SM', 'SMI', 'MIT', 'ITH'] What if we have 2-grams + 3-grams + 4 grams: ['JO', 'OH', 'HN', 'N ', ' S', 'SM', 'MI', 'IT', 'TH'] + ['JOH', 'OHN', 'HN ', 'N S', ' SM', 'SMI', 'MIT', 'ITH'] + ['JOHN', 'OHN ', 'HN S', 'N SM', ' SMI', 'SMIT', 'MITH']
We would have more components (features): ['JO', 'OH', 'HN', 'N ', ' S', 'SM', 'MI', 'IT', 'TH', 'JOH', 'OHN', 'HN ', 'N S', ' SM', 'SMI', 'MIT', 'ITH', 'JOHN', 'OHN ', 'HN S', 'N SM', ' SMI', 'SMIT', 'MITH'] to get the TF-IDF for.
I believe this would be helpful for short text comparison / matching and hope this question makes sense.
Thank you, iibarant
On Tuesday, November 23, 2021, 01:42:37 p.m. EST, ParticularMiner ***@***.***> wrote:
Hi @iibarant
I'm well thanks! I hope you are too. You're still very busy, it seems.
Regarding your question — I don't fully understand. Perhaps you could give me an example with data and a demostration of how you would expect multiple n-grams to work.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
Hi @iibarant
See if the latest version (red-string-grouper-0.1.1) does what you want.
To install, execute the following command:
pip install git+https://github.com/ParticularMiner/red_string_grouper.git@master
red-string-grouper should now accept multiple ngram-sizes, such as:
ngram_size=[3, 5]
By the way, any chance I'll get a fitting example using multiple DataFrame
's to add to red_string_grouper's README.md?
@iibarant
I forgot to mention — can you also let me know if the warnings you used to experience are still appearing or not.
Hi, I will get back to on that on by end of Wednesday, November 24, New York time.
On Tuesday, November 23, 2021, 03:36:55 p.m. EST, ParticularMiner @.***> wrote:
Hi @iibarant
See if the latest version (red-string-grouper-0.1.1) does what you want.
To install, execute the following command:
pip install @.*** red-string-grouper should now accept multiple ngram-sizes, such as:
ngram_size=[3, 5] By the way, any chance I'll get a fitting example using multiple DataFrame's to add to red_string_grouper's README.md?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.
No problem @iibarant .
You've likely already noticed that your email server cleverly masks-out any email addresses found in your messages. So for instance, the "pip install" command I gave you is not displayed as I originally wrote (because of the presence of the @ symbol in it). I think the same thing happened to your own email address in an earlier message.
Hi @ParticularMiner,
Hope you are doing good.
I got to work on the same project again and have a question / suggestion - would it be possible to use multiple n-grams to get more features? Like currently we have the following - ngram_size: The amount of characters in each n-gram. Default is 3.
What if we get n-grams in a list like [2,3,4] and get more vector components - ngrams=2 plus ngrams=3 and ngrams=4?
What do you think?
By the way, the string_grouper approach is really good in terms of speed and efficiency. Great work!
Thank you, iibarant