INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene
http://inl.github.io/BlackLab/
Apache License 2.0
101 stars 52 forks source link

Having hierarchical views inside blacklab? #12

Closed danyaljj closed 8 years ago

danyaljj commented 8 years ago

Is it possible to have hierarchical view for some of the views? Why this is important? Suppose you have the following two views: 1) A row text view 2) A POS tag view 3) A taxonomy view: a tree structured view of words and how words are hierarchically connected to each other. For example, one example hierarchy is:

=> respiratory disease => disease => illness => ill health => pathological state => physical condition => condition => state => attribute => abstraction => entity

The question can I query for any NN (nouns) that is in the subcategory of disease?

I suspect that faceted search of Lucene might helpful for this, if it is available inside BlackLab. https://chimpler.wordpress.com/2013/01/30/faceted-search-with-lucene/, http://stackoverflow.com/questions/14852995/tree-search-with-lucene

@dirkgr : any idea? (might as well be of your interest) FYI @cttsai

dirkgr commented 8 years ago

Faceted search is different, because it applies at a document level. You want facets at a word level. I think this can be done with a single extra view that captures the position in the hierarchy, combined with a prefix search.

For example, you have the word "pneumonia", and so you put the string "/entity/abstraction/attribute/state/.../illness/disease/respiratory_disease" into the extra view. Then you can query for "/entity/abstraction/attribute/state/.../illness/*" to get all mentions of illnesses.

jan-niestadt commented 8 years ago

Yes, that sounds like a good way to do it. The other possibility is to transform a query like "illness" to a query that combines all the different illnesses using OR, but that's probably less practical.

danyaljj commented 8 years ago

Maybe the example that I gave was not good. As I see, the BlackLab works is designed to work very well with the linear annotations (e.g. NER, POS, lemma, where there are labels for each word, or spans of consecutive words).

However I don't know how to make it work for annotations like relations between pair of words. For example, the relations which show the dependency between pair of words: http://nlp.stanford.edu/software/stanford-dependencies.shtml The dependency annotation, instead of having a label per word/span, it has labels per pair of words/spans. It would be nice to extend BlackLab to handle such relational indexing/querying.

PS. There are a lot of annotations which are relational and can be represented with trees. E.g.

jan-niestadt commented 8 years ago

Okay, I see what you mean now. This is something we've discussed occasionally, and would like to be able to do as well, but unfortunately we haven't thought of a good generic way to do it yet.

Parts of it you can do now with the XML tag search features (i.e. find sentences containing two noun phrases), but you can't search for elements that refer to each other by id, for example.

To enable that kind of thing, you would probably at least need to store XML attributes in the forward index and use those to answer queries in multiple stages. So for example: find all verbs, then find all subject-ids they are associated with (through the XML attribute forward index), then find all those subjects. Ideally you would want this to work not just with single words but groups of words referring to other groups of words as well.

Suggestions (or full implementations :-) are welcome.

danyaljj commented 8 years ago

I see. I need to first understand what BlackLab is doing; I will need to spend sometime on your implementation. Also, one thing I am trying to understand is, whether a Lucene-based implementation (like BlackLab) is the right way to attack this problem? (a set of database operations might as well solve this; whether it will be efficient, I don't know). Soon I will come back with more idea.

jan-niestadt commented 8 years ago

Adding this functionality to BlackLab would allow you to combine it with the kinds of searches that are already supported. Of course, if you have no need for the kinds of searches BlackLab enables, another approach (more hierarchical or relational) might suit you better.

BlackLab is designed to deal with sequential streams tokens; relationships between non-adjacent words (or groups of words) is a significant new feature to add. But as I said, I hope we can do it some day. Any help is appreciated. Good luck!

danyaljj commented 8 years ago

I took a look at the example Example.java, and I tried to come up with a solution. See if sounds any good.

Suppose we want to index the following dependency parse:

screen shot 2015-07-31 at 6 59 46 pm

Following the example in the definition of Example.java, we define the input to be:

    static String[] testData = {
    "<doc>" + "<w l='the'   p='art' depIn='det0'>The</w> "
    + "<w l='quick' p='adj' depIn='amod0'>quick</w> "
    + "<w l='brown' p='adj' depIn='amod1'>brown</w> "
    + "<w l='fox'   p='nou' depIn='nsubj0' depOut='det0;amod0;amod1'>fox</w> "
    + "<w l='jump'  p='vrb' depIn='nmod0'   depOut='nsubj0'>jumps</w> "
    + "<w l='over'  p='pre' depIn='case0'>over</w> "
    + "<w l='the'   p='art' depIn='det0'>the</w> "
    + "<w l='lazy'  p='adj' ∂depIn='amod2'>lazy</w> "
    + "<w l='dog'   p='nou' depIn='nmod0'   depOut='case0;det0;amod0'>dog</w>" + ".</doc>" 
};

Essentially each edge in the graph gets converted into two labels, one for depIn and the other for depOut. Since some nodes might have more than one incoming/outgoing edges, one property might have more than one value, for example depOut='case0;det0;amod0' for dog.

Some more changes need to be applied in the DocIndexerExample handler. We can do the following:

@Override
public void startElement(String uri, String localName, String qName,
        Attributes attributes) {
    super.startElement(uri, localName, qName, attributes);
    propLemma.addValue(attributes.getValue("l"));
    propPartOfSpeech.addValue(attributes.getValue("p"));
    propPunct.addValue(consumeCharacterContent());
    propDepIn.addValue(attributes.getValue("depIn"));
    propPunct.addValue(consumeCharacterContent());
    propDepOut.addValue(attributes.getValue("depOut"));
    propPunct.addValue(consumeCharacterContent());
}           

It might make sense to split the values with ; and addValue each (is this correct?)

@Override
public void startElement(String uri, String localName, String qName,
        Attributes attributes) {
    super.startElement(uri, localName, qName, attributes);
    propLemma.addValue(attributes.getValue("l"));
    propPartOfSpeech.addValue(attributes.getValue("p"));
    propPunct.addValue(consumeCharacterContent());
    String[] valsIn = attributes.getValue("depIn").split(";");
    propDepIn.addValue(valsIn[0]);
    for(int i = 1; i < valsIn.length; i++)
       propDepIn.addValue(valsIn[i], 0);
    propPunct.addValue(consumeCharacterContent());
    String[] valsOut = attributes.getValue("depOut").split(";");
    propDepOut.addValue(valsOut[0]);
    for(int i = 1; i < valsOut.length; i++)
        propDepOut.addValue(valsOut[i], 0);
    propPunct.addValue(consumeCharacterContent());
}

What do you think @jan-niestadt ? FYI @kavyasrinet

jan-niestadt commented 8 years ago

Yes, that would work in principle. And the way you added multiple property values at one token position is correct.

However, if you choose to split the ;-separated string and add them as separate values, right now only the first is saved in the forward index. So you'd either need to make it possible to store multiple values per token in the forward index, or you'd need to stick with the original ;-separated string as the property value.

Note also that, if every dependency in your corpus gets a unique id, the terms index for the forward index will get very large (because all terms are unique).

You would need to add TextPattern and corresponding SpanQuery/Spans classes that "follow a dependency", i.e. for the "nsubj" relation, produce all subject tokens pointed to by the input tokens. It would do this by querying the forward index for the attribute value, parsing it to find the "nsubj" relation id, then find tokens where depIn contains that relation id. This does mean an extra forward index query and an extra reverse index query for each input token, so it may be slow for large result sets.