The data iterator in data/iterators.py currently partitions the dataset in a very naive way: it reverses the list of .csv's inside the dataset folder and gets the top floor(max_count / batch_size) * batch_size datapoints of each .csv and yields them in batches.
This is incorrect because we instead need to yield a total of max_count datapoints across all .csv's and because the sets of training data and testing data currently have an overlap (they should form a partition over the entire dataset, with the exception of the occasional data points that can't form a batch of size batch_size). It also causes the following error on validation jobs with batch_size > 1: ValueError: array split does not result in an equal division.
Also, the count_datapoints() should take into consideration the requirement that all .csv's have a line with the headers (i.e., count could start with -1) and this requirement should be enforced somewhere in the Dataset Manager (could be a separate issue/PR).
The data iterator in
data/iterators.py
currently partitions the dataset in a very naive way: it reverses the list of.csv
's inside the dataset folder and gets the topfloor(max_count / batch_size) * batch_size
datapoints of each.csv
and yields them in batches.This is incorrect because we instead need to yield a total of
max_count
datapoints across all.csv
's and because the sets of training data and testing data currently have an overlap (they should form a partition over the entire dataset, with the exception of the occasional data points that can't form a batch of sizebatch_size
). It also causes the following error on validation jobs with batch_size > 1:ValueError: array split does not result in an equal division
.Also, the
count_datapoints()
should take into consideration the requirement that all.csv
's have a line with the headers (i.e.,count
could start with -1) and this requirement should be enforced somewhere in the Dataset Manager (could be a separate issue/PR).