LAHTeR / htr-quality-classifier

Detect quality of (digitized) text.
GNU General Public License v3.0
3 stars 0 forks source link

Use "INT Historische Woordenlijst" #14

Open carschno opened 7 months ago

carschno commented 7 months ago

INT provides a historical word list: https://taalmaterialen.ivdnt.org/download/tstc-int-historische-woordenlijst/

This could replace the historic dictionary we use currently, which was generated from the Iceberg project ground truth.

This list contains word frequencies. Low-frequency words (with a frequency threshold parameter) may have to be removed.

This historic word list is more recent and still being updated: https://ivdnt.org/woordenboeken/woordenboek-der-nederlandsche-taal/ It is unclear, however, if this is downloadable too. Related Slack discussion