Recognize synonym phrases with a plural term

OntoGene / OGER

GNU Affero General Public License v3.0

23 stars 8 forks source link

Recognize synonym phrases with a plural term #5

Closed hrshdhgd closed 3 years ago

hrshdhgd commented 3 years ago

Hello,

I was running a document through OGER for NER using Human Phenotype Ontology (HP). The document contained the phrase "restrictive deficit on pulmonary function test" which went untagged in spite of the dictionary having the phrase "restrictive deficit on pulmonary function tests" in it listed with a CURIE (HP:0002091) as a synonym. Notice the only difference is the 's' at the end of the word 'test' in the phrase. While debugging this, I just added an 's' (test => tests) which seemed to have fixed the issue and OGER seemed to recognize it then.

I was under the impression that OGER accounted for this through lemmatization of words in a phrase too. Please correct me if I'm wrong.

lfurrer commented 3 years ago

There is stemming, but it isn't turned on by default, I think. There are a number of knobs to control the fuzzy matching, see the wiki.

hrshdhgd commented 3 years ago

Thanks @lfurrer ! I used termlist_normalize = stem-Porter and termlist_normalize = stem-Lancaster , both gave me results with majority of the tokens common but some present in one and not in the other (I'm guessing those depend on the stemmer used). But neither of them tagged the phrase in question. Thoughts?

lfurrer commented 3 years ago

Right. It definitely should match (there's no restriction on the number of tokens in a term).

One more thing to check: When OGER indexes the dictionary, it caches the index on disk (by default next to the original dictionary, I think). This is convenient for the time saving on repeated calls, but the problem is that there is no proper cache invalidation in place. In particular, when changing the normalization settings, changes may not take effect because of this. With the termlist-force-reload option, you can manually invalidate the cache.

Another possiblity is that there are invisible characters (like soft hyphens) in one of the terms (document or dictionary), rendering them unequal.

Otherwise it's an obscure bug.

hrshdhgd commented 3 years ago

That did the trick! The term list-force-reload fixed it along with termlist_normalize = stem-Porter! Thank you so much for patiently helping me out @lfurrer !