dainiusjocas / lucene-grep

Grep-like utility based on Lucene Monitor compiled with GraalVM native-image
Apache License 2.0
190 stars 5 forks source link

Extending analyzers with eg. Finnish stemmer Voikko #84

Closed jussiarpalahti closed 2 years ago

jussiarpalahti commented 3 years ago

Hi.

As per https://news.ycombinator.com/item?id=26931774 my note about Finnish stemming in lucene-grep.

Let me first say that Finnish support in a tool like this would probably be of value to only that small percentage of Finnish speaking population that would use this tool. Which probably means only me :) So consider this issue as information on my very particular use case and not a request for implementation.

That said, I did test using English stemmer as well. I seem to recall that Lucene can do men -> man and mice -> mouse with some analyser combinations. I don't have Lucene nor Solr/Elasticsearch at the moment to test with. If possible an extended English stemmer could be nice to have.

English test as in readme:

❯ echo "dogs and cats" | ./lmgrep --only-analyze --analysis='{"analyzer": {"name": "English"}}'
["dog","cat"]
❯ echo "men and mice" | ./lmgrep --only-analyze --analysis='{"analyzer": {"name": "English"}}'
["men","mice"]

Finnish with lucene-grep

❯ echo "kauppias" | ./lmgrep --only-analyze --analysis='{"analyzer": {"name": "Finnish"}}'
["kauppias"]
❯ echo "kauppiaan" | ./lmgrep --only-analyze --analysis='{"analyzer": {"name": "Finnish"}}'
["kaupia"]
❯ echo "kauppiaan" | ./lmgrep --analysis='{"analyzer": {"name": "Finnish"}}' kauppias
❯ echo "kauppiaan" | ./lmgrep --analysis='{"analyzer": {"name": "Finnish"}}' kaupia
❯ echo "kauppiaan" | ./lmgrep --analysis='{"analyzer": {"name": "Finnish"}}' kauppiaan
*STDIN*:1:kauppiaan

Finnish stemmer Voikko analyser results below. I don't have Voikko configured with Lucene/Solr/Elastic search at the moment, so here I'm using Python with libvoikko. Additional platforms, like Solr and ElasticSearch, can be found through here: https://github.com/voikko/corevoikko/wiki

Voikko("fi").analyze("kauppiaan")
{'BASEFORM': 'kauppias',
  'CLASS': 'nimisana',
  'FSTOUTPUT': '[Ln][Xp]kauppias[X]kauppiaa[Sg][Ny]n',
  'NUMBER': 'singular',
  'SIJAMUOTO': 'omanto',
  'STRUCTURE': '=ppppppppp',
  'WORDBASES': '+kauppias(kauppias)'},

Voikko("fi").analyze("kauppias")
[{'BASEFORM': 'kauppias',
  'CLASS': 'nimisana',
  'FSTOUTPUT': '[Ln][Xp]kauppias[X]kauppia[Sn][Ny]s',
  'NUMBER': 'singular',
  'SIJAMUOTO': 'nimento',
  'STRUCTURE': '=pppppppp',
  'WORDBASES': '+kauppias(kauppias)'}]

BASEFORM is what Voikko provides for stemming. Voikko gets both correct, whereas Snowball gives different base words, of which one, kaupia, doesn't exist. Though this has been the case for years in Lucene if I remember correctly. Snowball does get another word, auton -> auto, correct and search works with base and stemmed words. For some reason I couldn't get kauppias to match to its Snowball stem with lucene-grep.

Voikko is originally C++ code. I don't know what Graal's story is with supporting libraries that aren't purely Java.

dainiusjocas commented 2 years ago

Thanks for the issue! I'm looking into it and I have several leads that would help me to resolve it.

There is an Elasticsearch plugin, which means, that the voikko is useable in Java land

The library on which the ES plugin is based. So, lucene-grep could leverage it and add an additional token filter.

I'll investigate it further.

dainiusjocas commented 2 years ago

It works :tada:

If you @jussiarpalahti is interested then you could build lucene-grep binary by yourself by running:

(export LMGREP_FEATURE_RAUDIKKO=true && bb generate-reflection-config && make build)
echo "kauppias kauppiaan" | \
  ./lmgrep \
  --only-analyze \
  --analysis='
  {
    "tokenizer": {"name": "standard"},
    "token-filters": [
      {"name": "raudikko"}
    ]
  }
  '
["kauppias","Kauppi","kauppias"]
jussiarpalahti commented 2 years ago

Hi @dainiusjocas

Thank you for getting back to this!

Unfortunately I haven't been able to get a working system running where lucene-grep was able to build. On Ubuntu after installing Clojure, Babushka and GraalVM I ran into obscure header problem with the compilation. Using a container build with Dockerfile of this repository I can get lucene-grep built, but bb generate-reflection-config fails on missing git. I have no familarity with Oracle Linux (which the container uses), but it seems to lack both yum and dnf package managers. I guess I could try to find all the necessary rpm files and install them directly, since rpm does exist there.

Could you perhaps describe how your development system is setup?

dainiusjocas commented 2 years ago

Hi @dainiusjocas

Thank you for getting back to this!

Unfortunately I haven't been able to get a working system running where lucene-grep was able to build. On Ubuntu after installing Clojure, Babushka and GraalVM I ran into obscure header problem with the compilation. Using a container build with Dockerfile of this repository I can get lucene-grep built, but bb generate-reflection-config fails on missing git. I have no familarity with Oracle Linux (which the container uses), but it seems to lack both yum and dnf package managers. I guess I could try to find all the necessary rpm files and install them directly, since rpm does exist there.

Could you perhaps describe how your development system is setup?

Damn, I'll fix the docker container. Would docker work for you?

jussiarpalahti commented 2 years ago

Yeah, Docker would work very well. I just couldn't figure out how Oracle Linux works in there.

dainiusjocas commented 2 years ago

@jussiarpalahti Fixed the Docker build with this PR https://github.com/dainiusjocas/lucene-grep/pull/129

Fetch source, latest main branch. Then run this:

make build-linux-static-musl-with-docker

This command will produce a file lmgrep.

The run this:

echo "kauppias foo kauppiaan" | \        
  ./lmgrep \ 
  --only-analyze \        
  --analysis='
  {
    "token-filters": [
      {"name": "raudikko"}
    ]
  }
  '

The output should be:

["kauppias","foo","Kauppi","kauppias"]

Let me know how it goes.

jussiarpalahti commented 2 years ago

Hi @dainiusjocas

It works! Thank you for your work 🙇