JuliaML / MLDatasets.jl

Utility package for accessing common Machine Learning datasets in Julia
https://juliaml.github.io/MLDatasets.jl/stable
MIT License
227 stars 45 forks source link

Add Medical Decathlon Datasets #47

Open Dale-Black opened 3 years ago

Dale-Black commented 3 years ago

I am new to Julia and I am working on a medical imaging research project. I would like to add the medical decathlon datasets (http://medicaldecathlon.com) (https://arxiv.org/pdf/1902.09063.pdf) to this repo as I think it would be a great way for me to learn what's going on and it would likely benefit the entire Julia community. I will definitely need help in this endeavor though so please let me know if that is something of interest to the contributors of this project.

Dale-Black commented 3 years ago

For reference, project Monai (https://monai.io) has this functionality and they have already prepared a public dropbox location (https://github.com/Project-MONAI/MONAI/blob/master/monai/apps/datasets.py)

johnnychen94 commented 3 years ago

A glance at this dataset it seems like >50GB size in google drive, I'm afraid this is not suitable for this repo.

Tokazama commented 3 years ago

It seems odd to me that monai created a dropbox for this when the downloads can be accessed via a link to a public google drive account. I mentioned Artifacts on slack. Using an Artifacts.toml in a package basically allows you to have a version controlled script for downloading stuff.

johnnychen94 commented 3 years ago

It's still not a good practice/experience to host an artifact of over 1Gb. Besides, currently the whole Julia ecosystem only produces <558Gb data for all the artifacts, adding this dataset as artifacts would dramatically increase the disk pressure by 1/10 and I don't think we should advertise this "solution".

Ref: check the Julia item for the storage in https://mirrors.bfsu.edu.cn/status/#server-status

Tokazama commented 3 years ago

I didn't know we hosted the download associated with each artifact. Why do we download the url associated with an artifact? According to the Pkg documentation the user still receives the download from the url in the artifact.

Dale-Black commented 3 years ago

I think I had that wrong. Monai hosts the dataset on AWS. My best guess as to why they chose to do this is because google drive has a daily download limit (which I often ran into when previously downloading it from the original google drive)

johnnychen94 commented 3 years ago

I didn't know we hosted the download associated with each artifact. Why do we download the url associated with an artifact? According to the Pkg documentation the user still receives the download from the url in the artifact.

I'm not sure if I understand your words correctly. Generally we are downloading artifacts from pkg servers, which are backed by storage servers hosted by julialang.org. So no, we don't download the dataset from the original url unless we failed to download that from the pkg server.

Currently, every artifacts with url provided are kept a copy in all those storage servers. There isn't a hard size limit on it. This says if we coded the dataset in Artifacts.toml with url provided, we are doing pressure test on the storage server....

cc: @staticfloat

johnnychen94 commented 3 years ago

FYI, I think dvc is a better tool for managing large datasets and experiments. It suits for all languages.

staticfloat commented 3 years ago

According to the Pkg documentation the user still receives the download from the url in the artifact.

It's a little complex; the Artifact.toml contains URLs which are the source URLs that your client can download from, but it will first attempt to download from a Pkg server, because those are generally closer and higher-performance.

Currently, every artifacts with url provided are kept a copy in all those storage servers. There isn't a hard size limit on it.

If users want to create 50GB artifacts, they are more than welcome to, but we will probably prevent them from being cached in the Pkg servers. :) That would then cause the downloads to fall back to the original location. So that's totally fine.

That being said, I also suggest DataDeps.jl as a natural solution for these very large datasets. That should make it easier for your package to download the dataset directly from the origin server, no matter what format it's in.