danielnaber / jwordsplitter

small Java library for splitting German compound words
Other
62 stars 11 forks source link

Adds support to collect all subwords of a given word #8

Closed Tobulus closed 7 years ago

Tobulus commented 7 years ago

Motivation:

In some cases it whould be nice to collect all subwords that exist in the dictionary and not just one valid split (the longest). One example whould be if jwordsplitter is used to index data, e.g. in elasticsearch/lucene/solr and you want to use fuzzy queries. E.g. I want to index a document with the word "hammerschlagbohrer". Currently I whould index "hammer", "schlag" and "bohrer". That works fine in combination with a search interface and a user that will not input misspelled words. But if a user searches for "Schlagborer" (missing a "h"), we cannot easly split the input word. To bypass this problem easily, we could index "schlagbohrer" and use the fuzzy-constraint provided by lucene.

If you agree with the idea/implementation, I could add more regression tests for the new component. :)

danielnaber commented 7 years ago

Thanks, look good. More tests and a pull requests that doesn't introduce formatting changes would be nice.

Tobulus commented 7 years ago

I've added some code, so that for getSubwords() we will always try to split a subword into smaller subwords. This case came across while adding test-cases, see "Sauerstoffflasche".

Could you increase the snapshot version?

danielnaber commented 7 years ago

Thanks, I've merged your changes.

Could you increase the snapshot version?

I'm not sure why that would help?

Tobulus commented 7 years ago

It whould be nice, if the changes could be easily integrated using maven. When do you plan to release version 4.2?

Tobulus commented 7 years ago

Actually I'm looking into some other stuff as well, so more pull requests in the near future are possible. So no stress with releasing. :)