Closed Tobulus closed 7 years ago
Thanks, look good. More tests and a pull requests that doesn't introduce formatting changes would be nice.
I've added some code, so that for getSubwords() we will always try to split a subword into smaller subwords. This case came across while adding test-cases, see "Sauerstoffflasche".
Could you increase the snapshot version?
Thanks, I've merged your changes.
Could you increase the snapshot version?
I'm not sure why that would help?
It whould be nice, if the changes could be easily integrated using maven. When do you plan to release version 4.2?
Actually I'm looking into some other stuff as well, so more pull requests in the near future are possible. So no stress with releasing. :)
Motivation:
In some cases it whould be nice to collect all subwords that exist in the dictionary and not just one valid split (the longest). One example whould be if jwordsplitter is used to index data, e.g. in elasticsearch/lucene/solr and you want to use fuzzy queries. E.g. I want to index a document with the word "hammerschlagbohrer". Currently I whould index "hammer", "schlag" and "bohrer". That works fine in combination with a search interface and a user that will not input misspelled words. But if a user searches for "Schlagborer" (missing a "h"), we cannot easly split the input word. To bypass this problem easily, we could index "schlagbohrer" and use the fuzzy-constraint provided by lucene.
If you agree with the idea/implementation, I could add more regression tests for the new component. :)