JuliaML / MLDatasets.jl

Utility package for accessing common Machine Learning datasets in Julia
https://juliaml.github.io/MLDatasets.jl/stable
MIT License
228 stars 46 forks source link

Cannot download MNIST #57

Closed HTofi closed 3 years ago

HTofi commented 3 years ago

Hello,

I've been trying to use the package for the first time:

julia> using MLDatasets

julia> MNIST.traindata(1)

The function tries to download the data from http://yann.lecun.com/exdb/mnist/, but it seems the website is down.

johnnychen94 commented 3 years ago

Interesting, this might be your local network issue; it works well here in China with or without proxy.

guyvdbroeck commented 3 years ago

This has been an issue for about a week now since Lecun's website went down for a while. It's back up now but downloads still appear to fail. Our unit tests have been failing because of it: https://github.com/Juice-jl/LogicCircuits.jl/runs/2100587498?check_suite_focus=true If you are not seeing the issue, you might have the dataset cached somewhere locally.

brechetp commented 3 years ago

Same issue here; a (hacky) workaround would be to fetch the images from the webarchive

First dev the MLDatasets package (] dev MLDatasets), then modify the url at ~/.julia/dev/MLDatasets/src/MNIST/MNIST.jl in the register call (at the end of the file) to:

"https://web.archive.org/web/20160828233817/http://yann.lecun.com/exdb/mnist/"

Then using MLDatasets; MNIST.traindata() finds and downloads the different files. You can "undev" the MLDatasets with ] free MLDatasets

CarloLucibello commented 3 years ago

If anyone knows a more reliable provider of MNIST we can just change the url in MLDataset to that

CarloLucibello commented 3 years ago

I can manually download the file from LeCunn's website now, but MLDatasets still error. This is weird, beacuse the urls have not changed

julia> MNIST.traindata(1)
This program has requested access to the data dependency MNIST.
which is not currently installed. It can be installed automatically, and you will not see this message again.

Dataset: THE MNIST DATABASE of handwritten digits
Authors: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
Website: http://yann.lecun.com/exdb/mnist/

[LeCun et al., 1998a]
    Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.
    "Gradient-based learning applied to document recognition."
    Proceedings of the IEEE, 86(11):2278-2324, November 1998

The files are available for download at the offical
website linked above. Note that using the data
responsibly and respecting copyright remains your
responsibility. The authors of MNIST aren't really
explicit about any terms of use, so please read the
website to make sure you want to download the
dataset.

Do you want to download the dataset from ["http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"] to "/home/carlo/.julia/datadeps/MNIST"?
[y/n]
y
ERROR: HTTP.ExceptionRequest.StatusError(503, "GET", "/exdb/mnist/train-images-idx3-ubyte.gz", HTTP.Messages.Response:
"""
HTTP/1.1 503 Service Unavailable
Date: Thu, 18 Mar 2021 12:38:49 GMT
Content-Type: text/html; charset=iso-8859-1
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: __cfduid=d29b97018afbffae7850db427044a63511616071128; expires=Sat, 17-Apr-21 12:38:48 GMT; path=/; domain=.lecun.com; HttpOnly; SameSite=Lax
CF-Cache-Status: DYNAMIC
cf-request-id: 08e6f18f48000032c4ec894000000001
Report-To: {"max_age":604800,"group":"cf-nel","endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report?s=zjCWPPwVy03sTPJggLy6leUBvyQbiSotTdyT0Ix82YV%2FLgN5oTb%2B5R6QRy5EtD1UEqI3aoa3%2B1iGaLMb7%2FEGssZp7U%2FYVcgHiHGmfwEWFQ%3D%3D"}]}
NEL: {"max_age":604800,"report_to":"cf-nel"}
Server: cloudflare
CF-RAY: 631e852baaab32c4-CDG
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400

""")
Stacktrace:
  [1] (::Base.var"#839#841")(x::Task)
    @ Base ./asyncmap.jl:177
  [2] foreach(f::Base.var"#839#841", itr::Vector{Any})
    @ Base ./abstractarray.jl:2141
  [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::Vector{String})
    @ Base ./asyncmap.jl:177
  [4] wrap_n_exec_twice
    @ ./asyncmap.jl:153 [inlined]
  [5] async_usemap(f::DataDeps.var"#12#13"{typeof(DataDeps.fetch_default), String}, c::Vector{String}; ntasks::Int64, batch_size::Nothing)
    @ Base ./asyncmap.jl:103
  [6] #asyncmap#823
    @ ./asyncmap.jl:81 [inlined]
  [7] asyncmap
    @ ./asyncmap.jl:81 [inlined]
  [8] run_fetch
    @ ~/.julia/packages/DataDeps/ooWXe/src/resolution_automatic.jl:104 [inlined]
  [9] download(datadep::DataDeps.DataDep{String, Vector{String}, typeof(DataDeps.fetch_default), typeof(identity)}, localdir::String; remotepath::Vector{String}, i_accept_the_terms_of_use::Nothing, skip_checksum::Bool)
    @ DataDeps ~/.julia/packages/DataDeps/ooWXe/src/resolution_automatic.jl:78
 [10] download
    @ ~/.julia/packages/DataDeps/ooWXe/src/resolution_automatic.jl:70 [inlined]
 [11] handle_missing
    @ ~/.julia/packages/DataDeps/ooWXe/src/resolution_automatic.jl:10 [inlined]
 [12] _resolve(datadep::DataDeps.DataDep{String, Vector{String}, typeof(DataDeps.fetch_default), typeof(identity)}, calling_filepath::String)
    @ DataDeps ~/.julia/packages/DataDeps/ooWXe/src/resolution.jl:83
 [13] resolve(datadep::DataDeps.DataDep{String, Vector{String}, typeof(DataDeps.fetch_default), typeof(identity)}, inner_filepath::String, calling_filepath::String)
    @ DataDeps ~/.julia/packages/DataDeps/ooWXe/src/resolution.jl:29
 [14] resolve(datadep_name::String, inner_filepath::String, calling_filepath::String)
    @ DataDeps ~/.julia/packages/DataDeps/ooWXe/src/resolution.jl:54
 [15] resolve
    @ ~/.julia/packages/DataDeps/ooWXe/src/resolution.jl:73 [inlined]
 [16] #2
    @ ~/.julia/packages/MLDatasets/y8COP/src/download.jl:17 [inlined]
 [17] withenv(f::MLDatasets.var"#2#3"{String, Nothing}, keyvals::Pair{String, String})
    @ Base ./env.jl:161
 [18] with_accept
    @ ~/.julia/packages/MLDatasets/y8COP/src/download.jl:10 [inlined]
 [19] #datadir#1
    @ ~/.julia/packages/MLDatasets/y8COP/src/download.jl:14 [inlined]
 [20] datadir
    @ ~/.julia/packages/MLDatasets/y8COP/src/download.jl:14 [inlined]
 [21] datafile(depname::String, filename::String, dir::Nothing; recurse::Bool, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ MLDatasets ~/.julia/packages/MLDatasets/y8COP/src/download.jl:32
 [22] datafile
    @ ~/.julia/packages/MLDatasets/y8COP/src/download.jl:32 [inlined]
 [23] #traintensor#2
    @ ~/.julia/packages/MLDatasets/y8COP/src/MNIST/interface.jl:49 [inlined]
 [24] #traindata#10
    @ ~/.julia/packages/MLDatasets/y8COP/src/MNIST/interface.jl:221 [inlined]
 [25] #traindata#11
    @ ~/.julia/packages/MLDatasets/y8COP/src/MNIST/interface.jl:225 [inlined]
 [26] traindata(args::Int64)
    @ MLDatasets.MNIST ~/.julia/packages/MLDatasets/y8COP/src/MNIST/interface.jl:225
 [27] top-level scope
    @ REPL[2]:1
queensferryme commented 3 years ago

The same issue here.

johnnychen94 commented 3 years ago

One way to solve this issue (and future potential similar issues) is to introduce Artifacts. We can make a fake release, uploads the datasets, and then bind the artifacts. Datasets available in this package are usually small enough, so storing them in Julialang's storage server should be okay.

I'm currently busy with my GSoC project so won't do this recently, but if there's anyone interested in it, I could offer help. An example of this can be found in TestImages.jl

Update:

I've uploaded MNIST datasets to https://github.com/JuliaML/MLDatasets.jl/releases/tag/v0.6.0-datasets. Anyone who hits this issue and only wants to quickly solves his own problem, could manually download these four files into ~/.julia/datadeps/MNIST and remove the mnist_ prefix, then MLDatasets will skip the downloads.

.julia/datadeps/
└── MNIST
    ├── t10k-images-idx3-ubyte.gz
    ├── t10k-labels-idx1-ubyte.gz
    ├── train-images-idx3-ubyte.gz
    └── train-labels-idx1-ubyte.gz

(Maybe I should upload only one artifact per dataset?)

johnnychen94 commented 3 years ago

I haven't tried it, but one could alternatively use the EMNIST MNIST dataset provided which is available in MLDataset@0.5.5, for example,

- using MLDatasets: MNIST
+ using MLDatasets.EMNIST: MNIST

It's not identical to MNIST but according to the EMNIST paper:

the EMNIST MNIST dataset is intended to exactly match the size and specifications of the original MNIST dataset. It is intended to be a drop-in replacement for the original MNIST dataset containing digits created through the conversion process outlined in Section II-A.

CarloLucibello commented 3 years ago

I found a reliable mirror!

audreyyeoCH commented 5 months ago

@CarloLucibello can you share your steps on how you did it ?