Open gaelforget opened 1 year ago
Regarding data uploads.
See https://github.com/gdcc/Dataverse.jl/issues/13 , https://github.com/IQSS/dataverse/issues/9298
We'd like to document how much gets downloaded, for users who may have limited internet access may want to know.
In juliageo.ipynb all data gets downloaded, which is about 100MB gets downloaded. This is almost entirely the geometries from GADM, which it only needs to download once. A Manifest.toml is also uploaded for reproducability.
Btw, I anticipate a nice of set of file types will end up in the JuliaEO dataverse repo.
It likely would be useful to have a few examples out of zip. To demo previewing and maybe lazy access functionalities, for example.
Looping in @pdurbin @atrisovic @felixcremer @rafaqz @Alexander-Barth just in case
quoting from @pdurbin on different platform, about the auto-unzip functionality for upload :
you have to double zip. This is how I do it for my dataset:
zip -r primary-data.zip primary-data -x '**/.*' -x '**/__MACOSX'
zip -r outer.zip primary-data.zip -x '**/.*' -x '**/__MACOSX'
For supporting notebooks that require data from Dataverse
, the simplest thing I could think of would be to have one tar.gz
or zip
file associated with the notebook folder in GitHub.
e.g. , Data_Visualizations_with_Makie.tar.gz
or some zip
version
I think this makes life easy on the self-guided user.
Are there strong preferences between tar.gz
and zip
?
Supposedly we can add software on the Dataverse or Julia side to extract meta data from tar.gz
and preview in UI / without downloading the file.
Would be good also to grab the Dockerfile as preview of a Docker image like DockerImage.tar.gz I think.
sessions that used custom Docker images
We'd like to collect these too and upload them to the Dataverse as well if possible. The central one that's meant to support running all notebooks (ideally) has already been posted there.
@gaelforget @visr last week we enabled a Binder button on Harvard Dataverse:
It looks like this on the "Global Workshop on Earth Observation with Julia 2023" dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OYBLGK
I'm scared to click it, though, because the Docker image is 1.7 GB! 😅 Binder will try to download all the files in the dataset, including that giant one.
Binder supports defining your own Dockerfile (an alternative to uploading the image itself): https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html
Is this something you'd like to try?
For more on Binder from the Dataverse perspective: https://guides.dataverse.org/en/5.12.1/admin/integrations.html#binder
For Land_Cover_Classification_of_Earth_Observation_images
the files used, depending on user choice, are :
For RF_classification_using_marida
the files used are :
Bands_Indices-S2
Train_Test-Datasets
A couple questions I have :
@EmanuelCastanho
Yes, Train_Test-Datasets
is a subset created by me from the original MARIDA dataset.
According to their MIT Licence and Creative Commons Attribution 4.0 International I think it is fine to include this subset in the Dataverse.
Please include this citation to the original dataset, if possible: Kikaki K, Kakogeorgiou I, Mikeli P, Raitsos DE, Karantzalos K (2022) MARIDA: A benchmark for Marine Debris detection from Sentinel-2 remote sensing data. PLoS ONE 17(1): e0262247. https://doi.org/10.1371/journal.pone.0262247
Bands_Indices-S2
can be included in the Dataverse.
@gaelforget
Yes,
Train_Test-Datasets
is a subset created by me from the original MARIDA dataset. According to their MIT Licence and Creative Commons Attribution 4.0 International I think it is fine to include this subset in the Dataverse.Please include this citation to the original dataset, if possible: Kikaki K, Kakogeorgiou I, Mikeli P, Raitsos DE, Karantzalos K (2022) MARIDA: A benchmark for Marine Debris detection from Sentinel-2 remote sensing data. PLoS ONE 17(1): e0262247. https://doi.org/10.1371/journal.pone.0262247
Bands_Indices-S2
can be included in the Dataverse.@gaelforget
Done. Thanks!
Dataverse : https://doi.org/10.7910/DVN/OYBLGK Zenodo : https://doi.org/10.5281/zenodo.8113073
automated data downloads have now been implemented for some notebooks
To enable future reproducibility of the notebooks by as many users as possible, we are envisioning this.
We are using Dataverse for this. It has great support for meta-data to help refer to all relevant data sources that should be accredited.
sessions that used an external data set shared at workshop
We'd like to collect one zipped file per session and push it to the dataverse repo as soon as possible.
This will create a DOI and permanent archive. It will also allow for automatic download and browsing via Dataverse.jl
sessions that only downloaded data automatically
We'd like to document how much gets downloaded, for users who may have limited internet access may want to know.
Could also be a good idea to provide a zipped file to archive at Dataverse as backup if possible.
sessions that used custom Docker images
We'd like to collect these too and upload them to the Dataverse as well if possible. The central one that's meant to support running all notebooks (ideally) has already been posted there.
If you prefer to maintain and archive your own way that's totally fine too. We'd like a DOI though so we can refer to it and rely on it in the future.
Non-text or large files in the repo
Ideally, one may want to avoid putting non-text files in the repo -- these cannot be diff'ed with git in a practical way, and increase the time to download / clone / etc the repo.
For
ipynb
files for example, one can instead provide jupytext and point to a rendered version elsewhere (e.g. the GitHub page for the repo in separate branch).pdf
,png
, etc can all be put elsewhere too in order to keep the repo small. For the whole repo that would likely be needed to stay under, say, 100MB.