Open myedibleenso opened 8 years ago
Isn't a gazetteer pretty much what the KB lookup tables and the lookup routines are?
What we're aiming for is something more like any entry in file X would be assigned the label Y. So we could detect everything in our Uniprot KB and assign it the label Protein, but grounding those would have to be something separate.
From the gazetteer we'd build an efficient state machine for matching.
We haven't discussed the syntax this would use, but perhaps something like this:
name: "gazetteer-example"
gazetteer: "pastas.txt"
type: token
label: Pasta
pattern: |
[gazetteer=/./]+ [word=pasta]?
where the contents of pastas.txt
would look like:
linguini
angel hair
ravioli
...
gnocchi
Nice example. But, if you're really looking up Proteins, Gene, Families, etc. then it's worth pointing out that these are already in memory in an efficient data structure (a Map). So an initial quick-and-dirty gazetteer function could be an existance test for the lookup value in the key set of the appropriate KB.
If you will create a more general purpose gazetteer and all you need is an existance test of the value, then I suggest a Trie structure for each gazetteer; a compact representation with quick, efficient lookup and no need for state-machine machinery.
This issue is for a general purpose enhancement to Odin (gazetteer-based matchers). It may never be used in REACH.
As far as the implementation, we were considering a trie but we are still undecided given some of our requirements. Thanks for the suggestion though.
The RuleNER already uses a trie that I optimized for our purposes. See e.a.s.struct.HashTrie. And yes, it would be nice to have Odin support for this.
On Thu, Jun 16, 2016 at 6:20 AM, Gus Hahn-Powell notifications@github.com wrote:
This issue is for a general purpose enhancement to Odin (gazetteer-based matchers). It may never be used in REACH.
As far as the implementation, we were considering a trie but we are still undecided given some of our requirements. Thanks for the suggestion though.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clulab/processors/issues/66#issuecomment-226378575, or mute the thread https://github.com/notifications/unsubscribe/ABH-zrwq5ABiUtkqN0_7limzknT4sYZhks5qMMDvgaJpZM4I25i1 .
Maybe we should explore algorithms like Aho–Corasick, Commentz-Walter, and Rabin–Karp that are designed for multiple string matching. It may be worthwhile since we will be doing this a lot.
There is also http://www.tgries.de/agrep/doc/agrep2ps.zip
From Gus's example I assumed that we were just doing a lookup of a single literal string. If that's the case, the Trie isn't doing multiple string matching and should be very fast. If I remember right, worst case on the order of O(n) where n is the string length and on average something like logK(avg string length) where K is the average branching factor (related to the size of the character set in use).
Anyway, it sounds like a thing that would be assigned to the programmer so that you guys can be freed up to do research. ;)
The idea is to search for all occurrence of all the patterns in the gazetteer inside the document. The same gazetteer would be applied over multiple documents. This has been studied a lot and I believe we should at least have a reason to reject those algorithms. I believe that Aho-Corasick uses a trie internally, but there is more to it than just that. And if we settle with just using a trie, wouldn't a dawg be more memory efficient?
dawg is more commonly referred to as dafsa: https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
Thanks for the clarification: I misunderstood the intention of the new gazetteer as a simple lookup. I see now why those other algorithms would be relevant and important to investigate.
Re DAWG: yes, that makes sense: "By allowing the same vertices to be reached by multiple paths, a DAFSA may use significantly fewer vertices than the strongly related trie data structure."
This is a nice algorithm for constructing the DAFSA incrementally. http://www.aclweb.org/anthology/J00-1002
Based on discussions at the event in Portland today, I feel this is actually very important for getting started with new domains. We will implement an initial version over the next few weeks.
I am going to go ahead and impose a deadline for the first version as the end of June. Notice it has been nearly a year since we last had activity on this issue!
Implementation said, what should usage of a gazetteer look like? Previously I introduced this example:
name: "gazetteer-example"
gazetteer: "pastas.txt"
type: token
label: Pasta
pattern: |
[gazetteer=/./]+ [word=pasta]?
If the use case is simply to see if a token is in the specified gazetteer, I'm not sure [gazetteer=/./]
is the way to go. Something like[gazetteer=/^ling/]
can be handled with an additional token constraint (ex. & lemma=/^ling/
). I think we really just need to assert membership, but [gazetteer=true]
seems weird to me.
Here is another idea:
resources:
- gazetteers:
# gazetteer name -> path/to/resource
pastas: path/to/pastas.txt
cities: path/to/cities.txt
states: path/to/states.txt
rules:
- name: "gazetteer-example-1"
label: Pasta
type: token
pattern: |
[gazetteer(pastas) & lemma=/^lin/]
[word=pasta]?
- name: "gazetteer-example-2"
label: Location
type: token
pattern: |
[gazetteer(cities) | gazetteer(states)]+
Thoughts?
I assume the gazetteer should match against a token's word
attribute.
Should gazetteer-based matching be case sensitive or not?
@myedibleenso: We are now closer to this, due to the new LexiconNER class. We need to plug this in Odin though.
Did this ever happen?
We'd like to add an efficient matcher for terms in a gazetteer.
Notes