HarikalarKutusu / cv-tbox-dataset-compiler

GNU Affero General Public License v3.0
0 stars 0 forks source link

Feat/Add detailed validated & splits text-corpus analysis #33

Closed HarikalarKutusu closed 8 months ago

HarikalarKutusu commented 8 months ago

This intermediate PR mainly analyzes some of the other buckets (validated) and training splits (train/dev/test) per language, per version and saves the data under language directory separate as a $<lc>_<ver>_tc_stats file.

It also removes list/array to string encoding from files.

HarikalarKutusu commented 6 months ago

This closes #25