clulab / processors

Natural Language Processors
https://clulab.github.io/processors/
Apache License 2.0
418 stars 100 forks source link

gazetteer matcher for Odin #66

Open myedibleenso opened 8 years ago

myedibleenso commented 8 years ago

We'd like to add an efficient matcher for terms in a gazetteer.

Notes

hickst commented 8 years ago

Isn't a gazetteer pretty much what the KB lookup tables and the lookup routines are?

myedibleenso commented 8 years ago

What we're aiming for is something more like any entry in file X would be assigned the label Y. So we could detect everything in our Uniprot KB and assign it the label Protein, but grounding those would have to be something separate.

From the gazetteer we'd build an efficient state machine for matching.

We haven't discussed the syntax this would use, but perhaps something like this:

name: "gazetteer-example"
gazetteer: "pastas.txt"
type: token
label: Pasta
pattern: |
  [gazetteer=/./]+ [word=pasta]?

where the contents of pastas.txt would look like:

linguini
angel hair
ravioli
...
gnocchi
hickst commented 8 years ago

Nice example. But, if you're really looking up Proteins, Gene, Families, etc. then it's worth pointing out that these are already in memory in an efficient data structure (a Map). So an initial quick-and-dirty gazetteer function could be an existance test for the lookup value in the key set of the appropriate KB.

If you will create a more general purpose gazetteer and all you need is an existance test of the value, then I suggest a Trie structure for each gazetteer; a compact representation with quick, efficient lookup and no need for state-machine machinery.

myedibleenso commented 8 years ago

This issue is for a general purpose enhancement to Odin (gazetteer-based matchers). It may never be used in REACH.

As far as the implementation, we were considering a trie but we are still undecided given some of our requirements. Thanks for the suggestion though.

MihaiSurdeanu commented 8 years ago

The RuleNER already uses a trie that I optimized for our purposes. See e.a.s.struct.HashTrie. And yes, it would be nice to have Odin support for this.

On Thu, Jun 16, 2016 at 6:20 AM, Gus Hahn-Powell notifications@github.com wrote:

This issue is for a general purpose enhancement to Odin (gazetteer-based matchers). It may never be used in REACH.

As far as the implementation, we were considering a trie but we are still undecided given some of our requirements. Thanks for the suggestion though.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clulab/processors/issues/66#issuecomment-226378575, or mute the thread https://github.com/notifications/unsubscribe/ABH-zrwq5ABiUtkqN0_7limzknT4sYZhks5qMMDvgaJpZM4I25i1 .

marcovzla commented 8 years ago

Maybe we should explore algorithms like Aho–Corasick, Commentz-Walter, and Rabin–Karp that are designed for multiple string matching. It may be worthwhile since we will be doing this a lot.

There is also http://www.tgries.de/agrep/doc/agrep2ps.zip

hickst commented 8 years ago

From Gus's example I assumed that we were just doing a lookup of a single literal string. If that's the case, the Trie isn't doing multiple string matching and should be very fast. If I remember right, worst case on the order of O(n) where n is the string length and on average something like logK(avg string length) where K is the average branching factor (related to the size of the character set in use).

Anyway, it sounds like a thing that would be assigned to the programmer so that you guys can be freed up to do research. ;)

marcovzla commented 8 years ago

The idea is to search for all occurrence of all the patterns in the gazetteer inside the document. The same gazetteer would be applied over multiple documents. This has been studied a lot and I believe we should at least have a reason to reject those algorithms. I believe that Aho-Corasick uses a trie internally, but there is more to it than just that. And if we settle with just using a trie, wouldn't a dawg be more memory efficient?

marcovzla commented 8 years ago

dawg is more commonly referred to as dafsa: https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton

hickst commented 8 years ago

Thanks for the clarification: I misunderstood the intention of the new gazetteer as a simple lookup. I see now why those other algorithms would be relevant and important to investigate.

hickst commented 8 years ago

Re DAWG: yes, that makes sense: "By allowing the same vertices to be reached by multiple paths, a DAFSA may use significantly fewer vertices than the strongly related trie data structure."

marcovzla commented 8 years ago

This is a nice algorithm for constructing the DAFSA incrementally. http://www.aclweb.org/anthology/J00-1002

myedibleenso commented 7 years ago

Based on discussions at the event in Portland today, I feel this is actually very important for getting started with new domains. We will implement an initial version over the next few weeks.

I am going to go ahead and impose a deadline for the first version as the end of June. Notice it has been nearly a year since we last had activity on this issue!

myedibleenso commented 7 years ago

Implementation said, what should usage of a gazetteer look like? Previously I introduced this example:

name: "gazetteer-example"
gazetteer: "pastas.txt"
type: token
label: Pasta
pattern: |
  [gazetteer=/./]+ [word=pasta]?

If the use case is simply to see if a token is in the specified gazetteer, I'm not sure [gazetteer=/./] is the way to go. Something like[gazetteer=/^ling/] can be handled with an additional token constraint (ex. & lemma=/^ling/). I think we really just need to assert membership, but [gazetteer=true] seems weird to me.

Here is another idea:


resources:
  - gazetteers:
     # gazetteer name -> path/to/resource
     pastas: path/to/pastas.txt
     cities: path/to/cities.txt
     states: path/to/states.txt

rules:
  - name: "gazetteer-example-1"
    label: Pasta
    type: token
    pattern: |
      [gazetteer(pastas) & lemma=/^lin/] 
      [word=pasta]?

  - name: "gazetteer-example-2"
    label: Location
    type: token
    pattern: |
      [gazetteer(cities) | gazetteer(states)]+

Thoughts?

myedibleenso commented 7 years ago

I assume the gazetteer should match against a token's word attribute.

Should gazetteer-based matching be case sensitive or not?

MihaiSurdeanu commented 7 years ago

@myedibleenso: We are now closer to this, due to the new LexiconNER class. We need to plug this in Odin though.

kwalcock commented 1 year ago

Did this ever happen?