Closed cjdoris closed 3 years ago
The column names and types are no longer part of the ARFFReader
type, and rows are now of type ARFFRow
instead of NamedTuple
. This puts vastly less strain on the compiler.
e.g. dataset 42087 from openml.org is 200MB, 1.5M rows, 13 columns, takes 12 seconds to load.
e.g. dataset 42762 from openml.org is 18MB, 100 rows, 28k columns, takes 1.5 seconds to load. It takes 40 seconds to convert this to a DataFrame, 99% of which is the compiler, so there is still excessive specialization going on somewhere (outside this package).
When the dataset contains many columns, parsing grinds to a halt. (See #4)
I assume that this is because we produce type-stable code for iterating rows, but that this is killing the compiler.
Investigate. Perhaps introduce a threshold above which we use type-unstable code.