Use the new validated_sentences.tsv and sentence_id field. We now cache the whole/newest text-corpora with some pre-calculations and for previous versions we use cached files only having sentence_id's to reach the data.
We order descending by file-size, thus increasing the multi-processing usage.
We started to use pyarrow's dtypes and defined many new type structures for further improvement. We don't use pandas indexes yet thou - needs some testing. We also continue to use "python" engine, "pyarrow" engine is not yet ready for all use cases.
We added simple sentence_domain statistics, we will add more with data accumulation.
One major change in my text-corpus analysis: We've been using the whole sentences used recording in buckets/splits and got what is in voice corpus. Now I use unique sentences, regardless of how many times they are recorded - so I now use the text-corpus. This seems more logical.
Known issues:
There are bugs in the new validated_sentences.tsv and we opened several issues in github (See 1, 2 and 3 - the first one is critical). I tried to remedy them in code to some extend, but not all of them.
For the former releases (<v17.0), we can only get sentence_id's using sentences, but the sentences got pre-processed in CorporaCreator, so they can have changes. So I could not get the whole text-corpus for these for now, I need to re-implement these in the code.
And of course anything between v14.0 - v16.1 will be incomplete (as anything entered through the web interface/write is not there).
Major Changes:
validated_sentences.tsv
andsentence_id
field. We now cache the whole/newest text-corpora with some pre-calculations and for previous versions we use cached files only havingsentence_id
's to reach the data.Known issues:
validated_sentences.tsv
and we opened several issues in github (See 1, 2 and 3 - the first one is critical). I tried to remedy them in code to some extend, but not all of them.