cjdoris / ARFFFiles.jl

Load and save ARFF files
MIT License
5 stars 2 forks source link

Faster parsing with thousands of columns #8

Closed cjdoris closed 3 years ago

cjdoris commented 3 years ago

When the dataset contains many columns, parsing grinds to a halt. (See #4)

I assume that this is because we produce type-stable code for iterating rows, but that this is killing the compiler.

Investigate. Perhaps introduce a threshold above which we use type-unstable code.

cjdoris commented 3 years ago

The column names and types are no longer part of the ARFFReader type, and rows are now of type ARFFRow instead of NamedTuple. This puts vastly less strain on the compiler.

e.g. dataset 42087 from openml.org is 200MB, 1.5M rows, 13 columns, takes 12 seconds to load.

e.g. dataset 42762 from openml.org is 18MB, 100 rows, 28k columns, takes 1.5 seconds to load. It takes 40 seconds to convert this to a DataFrame, 99% of which is the compiler, so there is still excessive specialization going on somewhere (outside this package).