gaelforget commented 1 year ago

To enable future reproducibility of the notebooks by as many users as possible, we are envisioning this.

We are using Dataverse for this. It has great support for meta-data to help refer to all relevant data sources that should be accredited.

sessions that used an external data set shared at workshop

We'd like to collect one zipped file per session and push it to the dataverse repo as soon as possible.

This will create a DOI and permanent archive. It will also allow for automatic download and browsing via Dataverse.jl

sessions that only downloaded data automatically

We'd like to document how much gets downloaded, for users who may have limited internet access may want to know.

Could also be a good idea to provide a zipped file to archive at Dataverse as backup if possible.

sessions that used custom Docker images

We'd like to collect these too and upload them to the Dataverse as well if possible. The central one that's meant to support running all notebooks (ideally) has already been posted there.

If you prefer to maintain and archive your own way that's totally fine too. We'd like a DOI though so we can refer to it and rely on it in the future.

Non-text or large files in the repo

Ideally, one may want to avoid putting non-text files in the repo -- these cannot be diff'ed with git in a practical way, and increase the time to download / clone / etc the repo.

For ipynb files for example, one can instead provide jupytext and point to a rendered version elsewhere (e.g. the GitHub page for the repo in separate branch).

pdf , png , etc can all be put elsewhere too in order to keep the repo small. For the whole repo that would likely be needed to stay under, say, 100MB.

gaelforget commented 1 year ago

Regarding data uploads.

See https://github.com/gdcc/Dataverse.jl/issues/13 , https://github.com/IQSS/dataverse/issues/9298

visr commented 1 year ago

We'd like to document how much gets downloaded, for users who may have limited internet access may want to know.

In juliageo.ipynb all data gets downloaded, which is about 100MB gets downloaded. This is almost entirely the geometries from GADM, which it only needs to download once. A Manifest.toml is also uploaded for reproducability.

gaelforget commented 1 year ago

Btw, I anticipate a nice of set of file types will end up in the JuliaEO dataverse repo.

It likely would be useful to have a few examples out of zip. To demo previewing and maybe lazy access functionalities, for example.

already there : netcdf (arrays), geotiff (rasters), and geojson (polygons)
possible / likely : csv, JLD2 (Julia / HDF5), zarr (arrays, cloud optimized), parquet (geospatial, cloud optimized), geotiff & nectdf (more, cl;oud optimized), Dockerfile (text but automated image build would be nice), data cubes, ...

Looping in @pdurbin @atrisovic @felixcremer @rafaqz @Alexander-Barth just in case

gaelforget commented 1 year ago

quoting from @pdurbin on different platform, about the auto-unzip functionality for upload :

you have to double zip. This is how I do it for my dataset:

zip -r primary-data.zip primary-data -x '**/.*' -x '**/__MACOSX'
zip -r outer.zip primary-data.zip -x '**/.*' -x '**/__MACOSX'

gaelforget commented 1 year ago

For supporting notebooks that require data from Dataverse, the simplest thing I could think of would be to have one tar.gz or zip file associated with the notebook folder in GitHub.

e.g. , Data_Visualizations_with_Makie.tar.gz or some zip version

I think this makes life easy on the self-guided user.

Are there strong preferences between tar.gz and zip?

Supposedly we can add software on the Dataverse or Julia side to extract meta data from tar.gz and preview in UI / without downloading the file.

Would be good also to grab the Dockerfile as preview of a Docker image like DockerImage.tar.gz I think.

pdurbin commented 1 year ago

sessions that used custom Docker images

We'd like to collect these too and upload them to the Dataverse as well if possible. The central one that's meant to support running all notebooks (ideally) has already been posted there.

@gaelforget @visr last week we enabled a Binder button on Harvard Dataverse:

https://github.com/IQSS/dataverse.harvard.edu/issues/208

It looks like this on the "Global Workshop on Earth Observation with Julia 2023" dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OYBLGK

Screen Shot 2023-02-01 at 3 34 02 PM

I'm scared to click it, though, because the Docker image is 1.7 GB! 😅 Binder will try to download all the files in the dataset, including that giant one.

Binder supports defining your own Dockerfile (an alternative to uploading the image itself): https://mybinder.readthedocs.io/en/latest/tutorials/dockerfile.html

Is this something you'd like to try?

For more on Binder from the Dataverse perspective: https://guides.dataverse.org/en/5.12.1/admin/integrations.html#binder

gaelforget commented 1 year ago

For Land_Cover_Classification_of_Earth_Observation_images the files used, depending on user choice, are :

Small : RGB (133M, 3 bands)
- Large : MS (2.8G, 13 bands)

gaelforget commented 1 year ago

For RF_classification_using_marida the files used are :

551M Bands_Indices-S2
57M Train_Test-Datasets

A couple questions I have :

is this a preprocessed subset of https://github.com/marine-debris/marine-debris.github.io ?
is it ok to upload to dataverse archive ? (so it can be downloaded by the user / notebook)

@EmanuelCastanho

EmanuelCastanho commented 1 year ago

Yes, Train_Test-Datasets is a subset created by me from the original MARIDA dataset. According to their MIT Licence and Creative Commons Attribution 4.0 International I think it is fine to include this subset in the Dataverse.

Please include this citation to the original dataset, if possible: Kikaki K, Kakogeorgiou I, Mikeli P, Raitsos DE, Karantzalos K (2022) MARIDA: A benchmark for Marine Debris detection from Sentinel-2 remote sensing data. PLoS ONE 17(1): e0262247. https://doi.org/10.1371/journal.pone.0262247

Bands_Indices-S2 can be included in the Dataverse.

@gaelforget

gaelforget commented 1 year ago

Yes, Train_Test-Datasets is a subset created by me from the original MARIDA dataset. According to their MIT Licence and Creative Commons Attribution 4.0 International I think it is fine to include this subset in the Dataverse.

Please include this citation to the original dataset, if possible: Kikaki K, Kakogeorgiou I, Mikeli P, Raitsos DE, Karantzalos K (2022) MARIDA: A benchmark for Marine Debris detection from Sentinel-2 remote sensing data. PLoS ONE 17(1): e0262247. https://doi.org/10.1371/journal.pone.0262247

Bands_Indices-S2 can be included in the Dataverse.

@gaelforget

Done. Thanks!

Dataverse : https://doi.org/10.7910/DVN/OYBLGK Zenodo : https://doi.org/10.5281/zenodo.8113073

gaelforget commented 1 year ago

automated data downloads have now been implemented for some notebooks

AIRCentre / JuliaEO

Files used in notebooks are not all in the Dataverse yet #36

sessions that used an external data set shared at workshop

sessions that only downloaded data automatically

sessions that used custom Docker images

Non-text or large files in the repo

sessions that used custom Docker images

43 #44 #45