KorAP / Krill

:mag: A Corpus Data Retrieval Index using Lucene for Look-Ups
BSD 2-Clause "Simplified" License
16 stars 3 forks source link

Sentence length restriction #22

Closed margaretha closed 8 years ago

margaretha commented 8 years ago

Due to the licenses of some resources, only a small amount of data/text can be shown as matches/ query results. It might happen that some sentences contain numerous words and thus a restriction on the sentence length is needed. The restriction should be sent by Kustvakt and handled by Krill while doing search.

Edit: the restriction should only be sent for nested spanqueries, since it won't reduce Krill workloads otherwise.

Akron commented 8 years ago

I disagree. The licenses are heterogeneos so this restriction may not apply to the whole VC. Kustvakt may inspect the search results afterwards and alter the response accordingly.

margaretha commented 8 years ago

since Kustvakt knows which documents in VC that have the such license, can't it divide the VC into two parts (with restriction and without restriction) and run two Krills in parallel?

Akron commented 8 years ago

There may be multiple restrictions - meaning multiple queries need to be run in parallel. But I don't know how this can be beneficial performancewise ... regarding your edit: Yes, if the length restricting SpanQuery is nested (i.e. not the root query), this may speed up things, but for license restrictions, it would be on the root, right? And worse, a) the statistical data would exclude valid matches and b) the restriction would mean that a SpanQuery needs to be run for all matches, while a filter afterwards only needs to run for the result set (the matches) - and it's technically trivial.

margaretha commented 8 years ago

I mean the license restriction about the sentence length, otherwise it would depends on what is restricted i guess. well, the idea is to make it more efficient, otherwise we don't need it. For the statistics, maybe it makes sense to inform the number of all matches although some are not allowed to be shown.

Akron commented 8 years ago

I am also talking about the length restriction - but there may be multiple different length restrictions in a VC.

The query would be less efficient than before, so I am :-1: . As we need all matches for the statistics, limitations don't make much sense.

However, I am :+1: for having a length restricting SpanQuery. But not for this license restricting purpose.

margaretha commented 8 years ago

so what would this length restricting spanquery do?

Akron commented 8 years ago

In Poliqarp I would write it like this:

length(2,5: <base/s=s>)

It would search for all sentences and skip all spans that are shorter then 2 and longer than 5 tokens.

margaretha commented 8 years ago

okay, so such a query is possible in poliqarp?

Akron commented 8 years ago

No, otherwise we would have implemented it. It may, however, be beneficial. But I guess we shouldn't think about it unless there is need for it, though it's good to have a proposal for such research questions.

margaretha commented 8 years ago

Okay. I am going to close this issue since it seems there is no suspense status or something like that.