KorAP / Krill

:mag: A Corpus Data Retrieval Index using Lucene for Look-Ups
BSD 2-Clause "Simplified" License
16 stars 3 forks source link

MultipleSpanDistanceQuery with Wildcards Bug #33

Closed margaretha closed 7 years ago

margaretha commented 7 years ago

Franck Bodmer reported that the following queries do not yield any results:

Akron commented 7 years ago

Should be fixed in https://github.com/KorAP/Krill/commit/291baad8872b4335cd483fdb501ba3b211e41c99 - @margaretha can you confirm this?

margaretha commented 7 years ago

meine? /+w1:2,s0 &Erfahrung meine+ /+w1:2,s0 &Erfahrung still do not work. See https://github.com/KorAP/Krill/blob/wildcards/src/test/java/de/ids_mannheim/korap/index/TestSampleIndex.java

Akron commented 7 years ago

Thank you! This seems to be a completely different bug. I added non-index tests in https://github.com/KorAP/Krill/commit/d273e1c7b30454b1290587fa47b50db307daa27a . The problem occurs when the rewrite results in SpanOr([]). Some queries seem to have problems with this situation. I made a fix for classes, but there is at least a problem with SpanMultiDistanceQuery left. It may be fixable in SimpleSpanQuery though.

margaretha commented 7 years ago

Hi Nils,

thanks for the test for multiple distance query. I have fixed it, but now SimpleSpans allow spans with null field because spanOr([]) has null field. I am not sure if it will be a problem later.

'+' is not a valid WildCard in KoralQuery. And it is not interpreted correctly, see https://github.com/KorAP/Koral/issues/13 .

Ah ic. I'll fix this.

Where is "meine."?

I mean meine without fullstop. It is not included in spanOr for meine?

Akron commented 7 years ago

Thank you for the fix.

I have fixed it, but now SimpleSpans allow spans with null field because spanOr([]) has null field. I am not sure if it will be a problem later.

Yes, in the classes fix I also test for field and return nothing, if no field is found.

I mean meine without fullstop. It is not included in spanOr for meine?

Following the definition of wildcards in KoralQuery, that's the correct behaviour. It's also correct following the COSMAS 2 definition.

margaretha commented 7 years ago

Following the definition of wildcards in KoralQuery, that's the correct behaviour. It's also correct following the COSMAS 2 definition.

I dont get this. Does Krill not implemented normal regex and that's why C2 quantifier have to be rewritten into normal regex? E.g. meine+ should be rewritten into meine.? according to https://github.com/KorAP/Koral/issues/13 and thus meine should be included in the corresponding spanOr.

However, I made some tests for the wildcards and found the spanOr behavior not as expected. meine* is fine.

spans(spanOr([tokens:s:meine, tokens:s:meinem, tokens:s:meinen, tokens:s:meiner, tokens:s:meinerseits, tokens:s:meines, tokens:s:meinesgleichen, tokens:s:meinesteils, tokens:s:meinetwegen, tokens:s:meinetwillen]))@START

meine.* generates

spans(spanOr([]))

meine?

spans(spanOr([tokens:s:meinem, tokens:s:meinen, tokens:s:meiner, tokens:s:meines]))

meine.?

spans(spanOr([]))

so apparently fullstop is not needed and meine is missing from meine? but you said it depends on the dictionary, so maybe the dictionary is incomplete.

Akron commented 7 years ago

I dont get this. Does Krill not implemented normal regex and that's why C2 quantifier have to be rewritten into normal regex?

Krill supports Regex (type:regex in KoralQuery) and WildCards (type:wildcard). But the support for wildcards is restricted to ? and *, which corresponds to the support by DOS and POSIX (see my mail from 2017-08-18). C2 has another wildcard + and whenever this is part of a C2-Wildcard search, Koral needs to rewrite it to a regex (including all other wildcard symbols in the string). And of course it needs to change the KQ type to type:regex.

E.g. meine+ should be rewritten into meine.? according to KorAP/Koral#13 and thus meine should be included in the corresponding spanOr.

Of course it needs to have the correct type then.

Akron commented 7 years ago

I have fixed the test suite accordingly in https://github.com/KorAP/Krill/commit/8e504ce1413dc2caa67f32c9519b8ff5075e7280 .

P.S. These tests should preferably be in the non-sample-index part.

Akron commented 7 years ago

Can this issue be closed?

margaretha commented 7 years ago

well, I guess the problem is solved. but I am still not clear about the difference between wildcard and regex, as to why we need to differentiate them in the implementation. The way I see the test you fix,

RegexpQuery(new Term("tokens", "s:meine.*")) is practically the same as 
WildcardQuery(new Term("tokens", "s:meine*"))

so every WildcardQuery can be written in RegexpQuery, can't it?

What is p in RegexpQuery btw?

Akron commented 7 years ago

Yes, every WildCard can be rewritten to a Regex, and internally it is represented similar (as an fsa in Lucene). We could remove Wildcards from KoralQuery support altogether - it wouldn't make a real difference.

What is p in RegexpQuery btw?

What do you mean?

margaretha commented 7 years ago

Yes, every WildCard can be rewritten to a Regex, and internally it is represented similar (as an fsa in Lucene). We could remove Wildcards from KoralQuery support altogether - it wouldn't make a real difference.

ic, I think that is nicer for KQ.

What do you mean?

Regex"p"Query

well i guess it's just an abbreviation from expression. Lucene calls it regexp instead of regex ;)

Akron commented 7 years ago

ic, I think that is nicer for KQ.

There may be some disadvantages though. E.g.

Regex"p"Query

Regex and Regexp are both common abbreviations for Regular Expressions.

margaretha commented 7 years ago

we don't define Regex in KQ, but we define Wildcards

what do you mean when we have type:regex and Wildcard + has to be rewritten into regex?

implementations for Wildcards may be implemented more efficiently than Regexes.

can be adjusted in the deserialization if it is really necessary

Akron commented 7 years ago

I mean: We use the Lucene subset of Regex and we don't force in KQ any specific Syntax for Regexes. It's pretty much not standardized and up to the implementers. For Wildcards we define ? and *. So - an implementer may decide not to support Regex at all in the backend - or rolls out its own syntax for Regex, while supporting Wildcards means, it needs to be implemented the way its specified in KQ.

margaretha commented 7 years ago

Ic, so Wildcards is default (priority higher) and regex is supplementary.

Nevertheless KQ implementation would not be straight forward in the case of mixed Wildcards. m+n By default (Wildcards), + would be treated as a normal character, but we would like to treat this rather as a regex and rewritten it to m.?n. so it seems that Wildcards is less useful than regex. Which QL would only need wildcards without regex? Even for C2 we need both.

Akron commented 7 years ago

I am closing this issue and continue the thread in https://github.com/KorAP/Koral/issues/13 .