danielnaber / jwordsplitter

small Java library for splitting German compound words
Other
62 stars 11 forks source link

jWordSplitter 4.8-SNAPSHOT

Copyright 2004-2007 Sven Abels
Copyright 2007-2023 Daniel Naber
Source code licensed under Apache License, Version 2.0 (see below)

This Java library can split German compound words into smaller parts. For example "Erhebungsfehler" will be split into "Erhebung" and "fehler". This is especially useful for German words but it can work with all languages, as long as a dictionary and a class extending AbstractWordSplitter is provided. So far, only German is supported and a German dictionary is included in the JAR. Even though it will work for some adjectives (e.g. "knallgelb" -> knall + gelb) and verbs (e.g. "zurückrudern" -> zurück + rudern) it works best for nouns.

Alternatives to this library might be compound-splitter or Lucene's DictionaryCompoundWordTokenFilter. You might also be interested in this German morphology dictionary.

Usage from Java

With Maven, use this dependency:

<dependency>
    <groupId>de.danielnaber</groupId>
    <artifactId>jwordsplitter</artifactId>
    <version>4.7</version>
</dependency>

Example usage:

AbstractWordSplitter splitter = new GermanWordSplitter(true);
List<String> parts = splitter.splitWord("Versuchsreihe");
System.out.println(parts);    // prints: [Versuchs, reihe]

Usage from command Line

To split a list of words (one word per line), use this command:

java -jar jwordsplitter-x.y.jar <filename>

Data location

To access the German dictionary from the JAR file, unzip the JAR. The dictionary is at de/danielnaber/jwordsplitter/wordsGerman.txt.

Notes about the algorithm

Building

Use build.sh to create the dictionary from the text files in resources.

Changelog

See CHANGES.md. If you need the old project history (for example to access tags that got lost when moving to git), check it out from SVN at https://sourceforge.net/p/jwordsplitter/code/HEAD/tree/

License

The source code part of this project is licensed under Apache License, Version 2.0. The integrated dictionary (wordsGerman.txt) is a subset of Morphy with additions from LanguageTool and licensed under Creative Commons Attribution-Share Alike 4.0.