get_similarity issue with overlap

GregSilverman commented 5 years ago

I was curious as to why overlap similarity was so slow. Since there is really no operation other than the intersection going on with it and all the other metrics are just variants on this, not sure why this would be the case.

This slowness has been exhibited in two different corpora (i2b2 and mipacq). I am about to test it on a 3rd internal corpus.

Any insight would be most welcome.

def get_similarity(x, y, n, similarity_name):
    if len(x) == 0 or len(y) == 0:
        # we define similarity between two strings
        # to be 0 if any of the two is empty.
        return 0.

    X, Y = set(make_ngrams(x, n)), set(make_ngrams(y, n))
    intersec = len(X.intersection(Y))

    if similarity_name == 'dice':
        return 2 * intersec / (len(X) + len(Y))
    elif similarity_name == 'jaccard':
        return intersec / (len(X) + len(Y) - intersec)
    elif similarity_name == 'cosine':
        return intersec / numpy.sqrt(len(X) * len(Y))
    elif similarity_name == 'overlap':
        return intersec
    else:
        msg = 'Similarity {} not recognized'.format(similarity_name)
        raise TypeError(msg)

soldni commented 5 years ago

Hi Greg!

I looked at this briefly and I cannot find any reason why overlap should be slower. It might have something to do with how the similarity is implemented in simstring, the underlying matching library, but I couldn't find any indication of that when I looked at its code.

Since you have the data ready, could you profile the code and check where the diff in performance is?

-Luca

GregSilverman commented 5 years ago

Howdy Luca! Will do... it might be a bit before I get to it, but it will happen.

Grazie!

On Wed, Jun 19, 2019 at 10:00 AM Luca Soldaini notifications@github.com wrote:

Hi Greg!

I looked at this briefly and I cannot find any reason why overlap should be slower. It might have something to do with how the similarity is implemented in simstring, the underlying matching library, but I couldn't find any indication of that when I looked at its code.

Since you have the data ready, could you profile the code and check where the diff in performance is?

-Luca

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Georgetown-IR-Lab/QuickUMLS/issues/45?email_source=notifications&email_token=AAHV3OXRAE75RED263V72K3P3JCZLA5CNFSM4HUTON72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCESUA#issuecomment-503597392, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHV3OUQDJCJZSYRIZEB4U3P3JCZLANCNFSM4HUTON7Q .

-- Greg M. Silverman Senior Systems Developer NLP/IE https://healthinformatics.umn.edu/research/nlpie-group University of Minnesota gms@umn.edu

› evaluate-it.org ‹

Georgetown-IR-Lab / QuickUMLS

get_similarity issue with overlap #45