Closed margaretha closed 7 years ago
Should be fixed in https://github.com/KorAP/Krill/commit/291baad8872b4335cd483fdb501ba3b211e41c99 - @margaretha can you confirm this?
meine? /+w1:2,s0 &Erfahrung meine+ /+w1:2,s0 &Erfahrung still do not work. See https://github.com/KorAP/Krill/blob/wildcards/src/test/java/de/ids_mannheim/korap/index/TestSampleIndex.java
Thank you! This seems to be a completely different bug. I added non-index tests in https://github.com/KorAP/Krill/commit/d273e1c7b30454b1290587fa47b50db307daa27a . The problem occurs when the rewrite results in SpanOr([])
. Some queries seem to have problems with this situation. I made a fix for classes, but there is at least a problem with SpanMultiDistanceQuery
left. It may be fixable in SimpleSpanQuery
though.
Hi Nils,
thanks for the test for multiple distance query. I have fixed it, but now SimpleSpans allow spans with null field because spanOr([]) has null field. I am not sure if it will be a problem later.
'+' is not a valid WildCard in KoralQuery. And it is not interpreted correctly, see https://github.com/KorAP/Koral/issues/13 .
Ah ic. I'll fix this.
Where is "meine."?
I mean meine without fullstop. It is not included in spanOr for meine?
Thank you for the fix.
I have fixed it, but now SimpleSpans allow spans with null field because spanOr([]) has null field. I am not sure if it will be a problem later.
Yes, in the classes fix I also test for field and return nothing, if no field is found.
I mean meine without fullstop. It is not included in spanOr for meine?
Following the definition of wildcards in KoralQuery, that's the correct behaviour. It's also correct following the COSMAS 2 definition.
Following the definition of wildcards in KoralQuery, that's the correct behaviour. It's also correct following the COSMAS 2 definition.
I dont get this. Does Krill not implemented normal regex and that's why C2 quantifier have to be rewritten into normal regex? E.g. meine+
should be rewritten into meine.?
according to https://github.com/KorAP/Koral/issues/13 and thus meine
should be included in the corresponding spanOr.
However, I made some tests for the wildcards and found the spanOr behavior not as expected.
meine*
is fine.
spans(spanOr([tokens:s:meine, tokens:s:meinem, tokens:s:meinen, tokens:s:meiner, tokens:s:meinerseits, tokens:s:meines, tokens:s:meinesgleichen, tokens:s:meinesteils, tokens:s:meinetwegen, tokens:s:meinetwillen]))@START
meine.*
generates
spans(spanOr([]))
meine?
spans(spanOr([tokens:s:meinem, tokens:s:meinen, tokens:s:meiner, tokens:s:meines]))
meine.?
spans(spanOr([]))
so apparently fullstop is not needed and meine
is missing from meine?
but you said it depends on the dictionary, so maybe the dictionary is incomplete.
I dont get this. Does Krill not implemented normal regex and that's why C2 quantifier have to be rewritten into normal regex?
Krill supports Regex (type:regex
in KoralQuery) and WildCards (type:wildcard
). But the support for wildcards is restricted to ?
and *
, which corresponds to the support by DOS and POSIX (see my mail from 2017-08-18). C2 has another wildcard +
and whenever this is part of a C2-Wildcard search, Koral needs to rewrite it to a regex (including all other wildcard symbols in the string). And of course it needs to change the KQ type to type:regex
.
E.g. meine+ should be rewritten into meine.? according to KorAP/Koral#13 and thus meine should be included in the corresponding spanOr.
Of course it needs to have the correct type then.
I have fixed the test suite accordingly in https://github.com/KorAP/Krill/commit/8e504ce1413dc2caa67f32c9519b8ff5075e7280 .
P.S. These tests should preferably be in the non-sample-index part.
Can this issue be closed?
well, I guess the problem is solved. but I am still not clear about the difference between wildcard and regex, as to why we need to differentiate them in the implementation. The way I see the test you fix,
RegexpQuery(new Term("tokens", "s:meine.*")) is practically the same as WildcardQuery(new Term("tokens", "s:meine*"))
so every WildcardQuery can be written in RegexpQuery, can't it?
What is p in RegexpQuery btw?
Yes, every WildCard can be rewritten to a Regex, and internally it is represented similar (as an fsa in Lucene). We could remove Wildcards from KoralQuery support altogether - it wouldn't make a real difference.
What is p in RegexpQuery btw?
What do you mean?
Yes, every WildCard can be rewritten to a Regex, and internally it is represented similar (as an fsa in Lucene). We could remove Wildcards from KoralQuery support altogether - it wouldn't make a real difference.
ic, I think that is nicer for KQ.
What do you mean?
Regex"p"Query
well i guess it's just an abbreviation from expression. Lucene calls it regexp instead of regex ;)
ic, I think that is nicer for KQ.
There may be some disadvantages though. E.g.
Regex"p"Query
Regex and Regexp are both common abbreviations for Regular Expressions.
we don't define Regex in KQ, but we define Wildcards
what do you mean when we have type:regex and Wildcard + has to be rewritten into regex?
implementations for Wildcards may be implemented more efficiently than Regexes.
can be adjusted in the deserialization if it is really necessary
I mean: We use the Lucene subset of Regex and we don't force in KQ any specific Syntax for Regexes. It's pretty much not standardized and up to the implementers. For Wildcards we define ?
and *
. So - an implementer may decide not to support Regex at all in the backend - or rolls out its own syntax for Regex, while supporting Wildcards means, it needs to be implemented the way its specified in KQ.
Ic, so Wildcards is default (priority higher) and regex is supplementary.
Nevertheless KQ implementation would not be straight forward in the case of mixed Wildcards. m+n By default (Wildcards), + would be treated as a normal character, but we would like to treat this rather as a regex and rewritten it to m.?n. so it seems that Wildcards is less useful than regex. Which QL would only need wildcards without regex? Even for C2 we need both.
I am closing this issue and continue the thread in https://github.com/KorAP/Koral/issues/13 .
Franck Bodmer reported that the following queries do not yield any results: