VIDA-NYU / reprozip

ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.
https://www.reprozip.org/
BSD 3-Clause "New" or "Revised" License
305 stars 34 forks source link

Script download and install of data sets #221

Open bmcfee opened 8 years ago

bmcfee commented 8 years ago

It would be useful to have a way in the config file to automate downloading data sets or other resources automatically without including them in the rpz file.

remram44 commented 8 years ago

Should this be in the RPZ file or in a "wrapper" that indicates how to combine RPZ file and data files? The same RPZ file and associated data might live in different storages and each of them might want to link to their own copy of the data.

bmcfee commented 8 years ago

I don't have strong opinions about this, but here's one way you might think about it.

Python's setuptools lets you specify optional "extra" dependencies through the extras_require directive. This is often used for things not core to the package's functionality, such as testing or documentation-building, where you might still want to have explicit dependencies in place. For example, I have the following in one of my setup.py scripts:

   extras_require={
        'docs': ['numpydoc', 'sphinx!=1.3.1', 'sphinx_rtd_theme',
                 'matplotlib >= 1.5'],
        'numba': ['numba >= 0.25'],
        'display': ['matplotlib >= 1.5'],
    }

You then install them by saying pip install packagename[docs,numba] (or whatever you want to name them.

I'm not sure how this would play out in RPZ land. I could imagine a simple interface where you can install a bare-bones rpz using the normal install procedure, but if you want the heavy-weight optional dependencies (eg, datasets hosted on s3 or something), you can install those by an extra flag like pip does. These would be treated as external dependencies, and not bundled within the rpz, so you'd have to have some part of the config that specifies how to collect the external dependencies.

remram44 commented 8 years ago

I actually have a reprounzip[all] for plugins 😉

The difference here is that setup.py lists dependencies by name and not location, and that is my issue here. Optionally the RPZ file could identify these missing input files by hash, but putting the location into the RPZ package (more or less meant to be immutable) can be discussed.

@VickySteeves comments?

fchirigati commented 8 years ago

@bmcfee I like the idea of adding "external dependencies" to ReproZip, and this is particularly useful for big datasets that do not need to be packed. I think the specification of these datasets/resources could come in the RPZ file (whoever packed the application informs ReproZip how to obtain these datasets/resources), but it should also be possible for the user, while unpacking, to associate these resources with their own copy of the data. The unpackers could then automatically download the datasets while setting up the environment, if users choose to do so via a flag.

remram44 commented 7 years ago

Also related: #220

nuest commented 4 years ago

Have you discussed the definition of remote file resources further?

ResearchObjects do this, AFAIK, but I wonder if there is a useful shared "specification" here, e.g. a YAML file that tells a platform "I need access to these remote sources", and then the platform can say if it can manage that (in a performant way). IMO this would be useful for o2r's reproducibility service and Binder's, too. I've thought about this because with remote sensing data, you won't just put 1PB of data into a research compendium (see also this poster).

If you're interested in a discussion, I can dig up more. IIRC this was also discussed in the Open Science Infrastructure Working Group calls.