danielnaber / jwordsplitter

small Java library for splitting German compound words
Other
62 stars 11 forks source link

All words in main dictionary considered non-compound(?) #20

Closed Tedwinder closed 6 years ago

Tedwinder commented 6 years ago

This may or may not be an issue, but I'm trying to build a dictionary of German compound words for a machine learning project. My thought was that if I ran the program on the entire included dictionary, I'd get a large list of compound words, which I could detect by the presence of commas, but it doesn't split any of them, even though many are obviously compound words (e.g. "abfallwirtschaftsamt" ). . . Is this intentional - .i.e. because words in the source dictionary are by definition not considered compound? If so, perhaps I could create a large list by running the program on an even larger dictionary. . .

danielnaber commented 6 years ago

It's more or less intentional, and I'm aware that it's not always correct for all use cases. But see https://github.com/danielnaber/jwordsplitter/blob/master/CHANGES.md#2017-09-10-42 for a new method that will get shorter words (getSubWords()).