cjdoris / ARFFFiles.jl

Load and save ARFF files
MIT License
5 stars 2 forks source link

Issues with loading ARFF files from OpenML #4

Closed jbrea closed 3 years ago

jbrea commented 3 years ago

Hi

Cool package! I tried to use it to download openml datasets but ran into some issues.

The following works fine for many datasets

using HTTP, JSON, ARFFFiles
function load_openml(id)
    r = JSON.parse(String(HTTP.request("GET", "https://www.openml.org/api/v1/json/data/$id").body))
    arff_file = HTTP.request("GET", r["data_set_description"]["url"])
    ARFFFiles.load(IOBuffer(arff_file.body))
end
load_openml(61)

but it fails e.g. for data set ids 379, 394, 554, 42087, 42904 and it takes ages for 42762 (I aborted eventually).

Also, for example for data set 1218 it is roughly 3x faster on my machine to load the data section with the CSV package

CSV.File(io; comment = "%", missingstring = "?", quotechar = ''', escapechar = '\\')

instead of using ARFFFiles.

ps. see here for my hacky ARFF reader.

cjdoris commented 3 years ago

Thanks for the issue. There are three things going on here:

cjdoris commented 3 years ago

As for the speed difference with CSV, I'm not a parsing expert. I'm happy to get within that ballpark!

cjdoris commented 3 years ago

I've fixed the above issues now. There's a new release v1.2.0 making it's way through the registrator.

Supporting the sparse format (present in dataset 394) will take a bit more work.

How does the speed compare with CSV.jl now? I think the new parser might be slower on datasets with a smallish number of columns.

jbrea commented 3 years ago

Great, thanks! That was quick :)

The parser is still slower than CSV, but very often speed doesn't matter too much for these kinds of tasks. If you want to benchmark, run:

using Pkg; Pkg.add("MLJOpenML");
MLJOpenML.load(1218)

As soon as ARFFFiles is superior to my hacky parser, I intend to use it in MLJOpenML.

cjdoris commented 3 years ago

Challenge accepted 😄

cjdoris commented 3 years ago

FYI I'm registering v1.3.0 which now supports the sparse format and is super fast (about 10% slower than CSV.jl for small numbers of columns, and about 2x faster on thousands of columns).

jbrea commented 3 years ago

Amazing!!! Thanks a lot for this!