MrCsabaToth / SOEMPI

Secure Enterprise Master Patient Index
30 stars 21 forks source link

Asking permission for using `name_to_nick.csv` #29

Closed DavidHarar closed 1 year ago

DavidHarar commented 1 year ago

Hello,

I wrote an article for Medium in which I trained multiple siamese neural network models for assess whether a nickname is plausible given a name. One of the dataset I have been using is name_to_nick.csv from this great repository. I saw that the repository isn't licensed under "public license", so, I would like to ask for permission to use publish the results of the model that was trained using name_to_nick.csv among other data sources :)

Thanks, David

MrCsabaToth commented 1 year ago

Thanks for bringing my attention to this. I was not aware I didn't have a definitive LICENSE file. I'll add one. I'm trying to find out the source of those CSVs (name to nick and nick to name). There's a source but I cannot find it. There was some list, but I heavily processed it and curated the two way CSVs.

MrCsabaToth commented 1 year ago

I'm a little embarrassed I don't have the original source of those CSV. I found your medium I think https://towardsdatascience.com/, which one is the article, I think I read some articles before, I'm interested in ML/AI

DavidHarar commented 1 year ago

Thanks so much for looking into that and for being responsive! I did submitted to towardsdatascience but my blog-post is in review now. My profile is here and you're more than welcome to follow :) In any case, what I did in the blog post at hand, was to build multiple siamese networks architectures, using name-nickname pairs as inputs, so these networks hopefully can tell which pairs are more plausible. That is, which nicknames are more plausible given names. I explored both entering pairs as texts, and as spectrograms (see here) after using Google's text to speech.
After the article will be aired I will gladly send you a link :)

DavidHarar commented 1 year ago

Hi, the blog post was published, it is here :)

MrCsabaToth commented 1 year ago

Thanks for the link I'll read the article later. However I need to write down something.

Nick names are interesting. You know that a standard similarity metric is the classic edit distance, but in case of a difference it doesn't distinguish based on how far is the difference from the beginning of the word. However statistically a typo tend to happen later in a word and not towards the beginning (I guess psychological reasons, we tend to be able to keep the beginning together more, the brain can pick it up sooner to realize there's a mistake in the beginning and correct it, mistake stands out a little more than later in the word). Therefore the good similarity metrics count on this and score differences in the beginning way more than the ones occurring later along the string.

This coincidentally can play along well with nick names. Often times nick names are shortened versions of the original, such as Christopher - Chris, Phillip - Phil. The preferred similarity metric I used with SOEMPI was Jaro-Winkler which has the property I mentioned (uses a prefix scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length l.), and on top of that it's also faster than the edit distance.

However nick names can be tricky too when they really don't resemble the shortened version of the original word, such as William - Bob and that's the reason I resorted to introduce some name - nickname logic routines my imperative logic.

When there are multiple datasets from various sources and human names are involved and there can be matches then treating nick names is crucial because nick name swaps will happen.