[x] [DONE in Analyzer v0.13.0] The data in $text_corpus_stats.* data is rather large, so dividing them under languages is needed. The analyzer only works per language-version of the dataset anyway.
[x] [DONE in Analyzer v0.13.0] Although new text corpus data is not exported to the git repo, older ones are there and can be reached by git tools. So we can get the data at a specific commit and analyze it. This way we can see the changes in time.
[x] [DONE in Analyzer v0.13.0] Add more analysis
[x] Grapheme distribution
[x] Phoneme distribution (if supported)
[x] Analyze text corpus usage in the buckets/splits wrt the above extracted text-corpus
Fix multiple problems/needs at once:
$text_corpus_stats.*
data is rather large, so dividing them under languages is needed. The analyzer only works per language-version of the dataset anyway.