ArchivesPortalEuropeFoundation / Topic-Detection

Using machine learning approaches for automatic topic detection in a multilingual environment
6 stars 0 forks source link

Add a language detector first #92

Open fedenanni opened 2 years ago

fedenanni commented 2 years ago

First split in sentences, then detect language, then use the tool accordingly.

fedenanni commented 2 years ago

Currently the tool expects the input in one of the selected languages (en, de, etc.). We could add a sentence tokeniser to detect the language, but it would be easier to do this before providing the text to the tool. So, first:

  1. split text in sentences
  2. provide to the tool only sentences in a single language
  3. aggregate the results