Closed jbrea closed 3 years ago
Thanks for the issue. There are three things going on here:
As for the speed difference with CSV, I'm not a parsing expert. I'm happy to get within that ballpark!
I've fixed the above issues now. There's a new release v1.2.0 making it's way through the registrator.
Supporting the sparse format (present in dataset 394) will take a bit more work.
How does the speed compare with CSV.jl now? I think the new parser might be slower on datasets with a smallish number of columns.
Great, thanks! That was quick :)
The parser is still slower than CSV, but very often speed doesn't matter too much for these kinds of tasks. If you want to benchmark, run:
using Pkg; Pkg.add("MLJOpenML");
MLJOpenML.load(1218)
As soon as ARFFFiles is superior to my hacky parser, I intend to use it in MLJOpenML.
Challenge accepted 😄
FYI I'm registering v1.3.0 which now supports the sparse format and is super fast (about 10% slower than CSV.jl for small numbers of columns, and about 2x faster on thousands of columns).
Amazing!!! Thanks a lot for this!
Hi
Cool package! I tried to use it to download openml datasets but ran into some issues.
The following works fine for many datasets
but it fails e.g. for data set ids 379, 394, 554, 42087, 42904 and it takes ages for 42762 (I aborted eventually).
Also, for example for data set 1218 it is roughly 3x faster on my machine to load the data section with the CSV package
instead of using ARFFFiles.
ps. see here for my hacky ARFF reader.