Closed hrshdhgd closed 3 years ago
There is stemming, but it isn't turned on by default, I think. There are a number of knobs to control the fuzzy matching, see the wiki.
Thanks @lfurrer ! I used termlist_normalize = stem-Porter
and termlist_normalize = stem-Lancaster
, both gave me results with majority of the tokens common but some present in one and not in the other (I'm guessing those depend on the stemmer used). But neither of them tagged the phrase in question. Thoughts?
Right. It definitely should match (there's no restriction on the number of tokens in a term).
One more thing to check: When OGER indexes the dictionary, it caches the index on disk (by default next to the original dictionary, I think). This is convenient for the time saving on repeated calls, but the problem is that there is no proper cache invalidation in place. In particular, when changing the normalization settings, changes may not take effect because of this. With the termlist-force-reload
option, you can manually invalidate the cache.
Another possiblity is that there are invisible characters (like soft hyphens) in one of the terms (document or dictionary), rendering them unequal.
Otherwise it's an obscure bug.
That did the trick! The term list-force-reload
fixed it along with termlist_normalize = stem-Porter
! Thank you so much for patiently helping me out @lfurrer !
Hello,
I was running a document through OGER for NER using Human Phenotype Ontology (HP). The document contained the phrase "restrictive deficit on pulmonary function test" which went untagged in spite of the dictionary having the phrase "restrictive deficit on pulmonary function tests" in it listed with a CURIE (HP:0002091) as a synonym. Notice the only difference is the 's' at the end of the word 'test' in the phrase. While debugging this, I just added an 's' (test => tests) which seemed to have fixed the issue and OGER seemed to recognize it then.
I was under the impression that OGER accounted for this through lemmatization of words in a phrase too. Please correct me if I'm wrong.