Markup hyphenated tokens

dpriskorn / odsc

Project that aims to sentenize all the open data of Riksdagen and other sources to create an easily linkable dataset of sentences that can be refered to from Wikidata lexemes and other resources

GNU General Public License v3.0

0 stars 0 forks source link

Markup hyphenated tokens #8

Open dpriskorn opened 10 months ago

dpriskorn commented 10 months ago

The Riksdagen open data often contains hyphenated words which end up like this in our rawtoken table: These are mostly garbage and should be handled somehow. Maybe we can train an AI to recognize when they are good based on lexemes?