Closed jussiarpalahti closed 2 years ago
Thanks for the issue! I'm looking into it and I have several leads that would help me to resolve it.
There is an Elasticsearch plugin, which means, that the voikko is useable in Java land
The library on which the ES plugin is based. So, lucene-grep could leverage it and add an additional token filter.
I'll investigate it further.
It works :tada:
If you @jussiarpalahti is interested then you could build lucene-grep binary by yourself by running:
(export LMGREP_FEATURE_RAUDIKKO=true && bb generate-reflection-config && make build)
echo "kauppias kauppiaan" | \
./lmgrep \
--only-analyze \
--analysis='
{
"tokenizer": {"name": "standard"},
"token-filters": [
{"name": "raudikko"}
]
}
'
["kauppias","Kauppi","kauppias"]
Hi @dainiusjocas
Thank you for getting back to this!
Unfortunately I haven't been able to get a working system running where lucene-grep was able to build. On Ubuntu after installing Clojure, Babushka and GraalVM I ran into obscure header problem with the compilation. Using a container build with Dockerfile of this repository I can get lucene-grep built, but bb generate-reflection-config
fails on missing git. I have no familarity with Oracle Linux (which the container uses), but it seems to lack both yum and dnf package managers. I guess I could try to find all the necessary rpm files and install them directly, since rpm does exist there.
Could you perhaps describe how your development system is setup?
Hi @dainiusjocas
Thank you for getting back to this!
Unfortunately I haven't been able to get a working system running where lucene-grep was able to build. On Ubuntu after installing Clojure, Babushka and GraalVM I ran into obscure header problem with the compilation. Using a container build with Dockerfile of this repository I can get lucene-grep built, but
bb generate-reflection-config
fails on missing git. I have no familarity with Oracle Linux (which the container uses), but it seems to lack both yum and dnf package managers. I guess I could try to find all the necessary rpm files and install them directly, since rpm does exist there.Could you perhaps describe how your development system is setup?
Damn, I'll fix the docker container. Would docker work for you?
Yeah, Docker would work very well. I just couldn't figure out how Oracle Linux works in there.
@jussiarpalahti Fixed the Docker build with this PR https://github.com/dainiusjocas/lucene-grep/pull/129
Fetch source, latest main branch. Then run this:
make build-linux-static-musl-with-docker
This command will produce a file lmgrep
.
The run this:
echo "kauppias foo kauppiaan" | \
./lmgrep \
--only-analyze \
--analysis='
{
"token-filters": [
{"name": "raudikko"}
]
}
'
The output should be:
["kauppias","foo","Kauppi","kauppias"]
Let me know how it goes.
Hi @dainiusjocas
It works! Thank you for your work 🙇
Hi.
As per https://news.ycombinator.com/item?id=26931774 my note about Finnish stemming in lucene-grep.
Let me first say that Finnish support in a tool like this would probably be of value to only that small percentage of Finnish speaking population that would use this tool. Which probably means only me :) So consider this issue as information on my very particular use case and not a request for implementation.
That said, I did test using English stemmer as well. I seem to recall that Lucene can do
men -> man
andmice -> mouse
with some analyser combinations. I don't have Lucene nor Solr/Elasticsearch at the moment to test with. If possible an extended English stemmer could be nice to have.English test as in readme:
Finnish with
lucene-grep
Finnish stemmer Voikko analyser results below. I don't have Voikko configured with Lucene/Solr/Elastic search at the moment, so here I'm using Python with libvoikko. Additional platforms, like Solr and ElasticSearch, can be found through here: https://github.com/voikko/corevoikko/wiki
BASEFORM is what Voikko provides for stemming. Voikko gets both correct, whereas Snowball gives different base words, of which one, kaupia, doesn't exist. Though this has been the case for years in Lucene if I remember correctly. Snowball does get another word, auton -> auto, correct and search works with base and stemmed words. For some reason I couldn't get kauppias to match to its Snowball stem with lucene-grep.
Voikko is originally C++ code. I don't know what Graal's story is with supporting libraries that aren't purely Java.