dennlinger / summaries

A toolkit for summarization analysis and aspect-based summarizers
MIT License
11 stars 0 forks source link

Incorporate GermanNet word splits #8

Closed dennlinger closed 1 year ago

dennlinger commented 2 years ago

To better compute ROUGE scores in German, it might be necessary to split compound words, and improve lemmatization/stemming. For that purpose, there is the GermanNet list of split compounds, which has over 100,000 samples available.

These are available for academic research only, which means that it might make sense to look for (potentially also commercially viable) alternatives elsewhere first. In particular, this probably also prevents us from licensing this under MIT or Apache...

dennlinger commented 2 years ago

An alternative approach could be this library: https://github.com/dtuggener/CharSplit It is also licensed under MIT, which is better for us, but would still have to check how good it works.