The main changes in this PR are inside the file get_distributions.py (and everything that uses that file). In particular
use cached properties whenever possible; they provide a convenient alternative to instantiating a dummy expensive variable in self and calculating it lazily whenever needed. The code is much shorter and cleaner while being as fast
do not define individual column numbers but rather a list of column names and read the position from that list whenever needed; while theoretically slightly slower it makes for a much shorter and more readable code, and also less error-prone
Other changes are the following:
start using numpy arrays internally in a few functions. They provide a much better interface for all sort of calculations we might need: faster and with a lot of things already defined on top of it
add a couple of scripts to do analyses. One of these scripts requires pandas which has not been added as a core dependency.
remove the last statistical test on the corpus; it was kept back because it was trying to read some corrupted files and we wanted to see why they were corrupted; simply reprocessing all the files has removed the error. this closes #63
The main changes in this PR are inside the file get_distributions.py (and everything that uses that file). In particular
Other changes are the following:
pandas
which has not been added as a core dependency.