Issues with loading ARFF files from OpenML

cjdoris / ARFFFiles.jl

Load and save ARFF files

MIT License

5 stars 2 forks source link

Issues with loading ARFF files from OpenML #4

Closed jbrea closed 3 years ago

jbrea commented 3 years ago

Cool package! I tried to use it to download openml datasets but ran into some issues.

The following works fine for many datasets

using HTTP, JSON, ARFFFiles
function load_openml(id)
    r = JSON.parse(String(HTTP.request("GET", "https://www.openml.org/api/v1/json/data/$id").body))
    arff_file = HTTP.request("GET", r["data_set_description"]["url"])
    ARFFFiles.load(IOBuffer(arff_file.body))
end
load_openml(61)

but it fails e.g. for data set ids 379, 394, 554, 42087, 42904 and it takes ages for 42762 (I aborted eventually).

Also, for example for data set 1218 it is roughly 3x faster on my machine to load the data section with the CSV package

CSV.File(io; comment = "%", missingstring = "?", quotechar = ''', escapechar = '\\')

instead of using ARFFFiles.

ps. see here for my hacky ARFF reader.

cjdoris commented 3 years ago

Thanks for the issue. There are three things going on here:

Some of these files contain non-ASCII characters. According to the ARFF description ARFF is an ASCII file format, so strictly these are malformed files. But I guess I could allow UTF-8 encoded files -- this would require rewriting the parser.
Some of these files use tabs instead of spaces to delimit words in the header, which I could allow.
That file that takes forever to load is probably because it has thousands of columns. I use a generated function to parse lines in a type-stable fashion, and I guess this kills the compiler when there are so many fields. I could probably have a threshold above which we stop caring about type-stability.

cjdoris commented 3 years ago

As for the speed difference with CSV, I'm not a parsing expert. I'm happy to get within that ballpark!

cjdoris commented 3 years ago

I've fixed the above issues now. There's a new release v1.2.0 making it's way through the registrator.

Supporting the sparse format (present in dataset 394) will take a bit more work.

How does the speed compare with CSV.jl now? I think the new parser might be slower on datasets with a smallish number of columns.

jbrea commented 3 years ago

Great, thanks! That was quick :)

The parser is still slower than CSV, but very often speed doesn't matter too much for these kinds of tasks. If you want to benchmark, run:

using Pkg; Pkg.add("MLJOpenML");
MLJOpenML.load(1218)

As soon as ARFFFiles is superior to my hacky parser, I intend to use it in MLJOpenML.

cjdoris commented 3 years ago

Challenge accepted 😄

cjdoris commented 3 years ago

FYI I'm registering v1.3.0 which now supports the sparse format and is super fast (about 10% slower than CSV.jl for small numbers of columns, and about 2x faster on thousands of columns).

jbrea commented 3 years ago

Amazing!!! Thanks a lot for this!