HarikalarKutusu / cv-tbox-dataset-compiler

GNU Affero General Public License v3.0
0 stars 0 forks source link

[PR] Complete rework for changes in CV v17.0 #34

Closed HarikalarKutusu closed 5 months ago

HarikalarKutusu commented 5 months ago

Major Changes:

Known issues:

  1. There are bugs in the new validated_sentences.tsv and we opened several issues in github (See 1, 2 and 3 - the first one is critical). I tried to remedy them in code to some extend, but not all of them.
  2. For the former releases (<v17.0), we can only get sentence_id's using sentences, but the sentences got pre-processed in CorporaCreator, so they can have changes. So I could not get the whole text-corpus for these for now, I need to re-implement these in the code.
  3. And of course anything between v14.0 - v16.1 will be incomplete (as anything entered through the web interface/write is not there).