ReproNim / reproman

ReproMan (AKA NICEMAN, AKA ReproNim TRD3)
https://reproman.readthedocs.io
Other
24 stars 14 forks source link

Updates from CRN sprint and some questions #6

Open satra opened 8 years ago

satra commented 8 years ago

For the CRN sprint at Stanford, people worked on packaging applications: https://github.com/BIDS-Apps using Docker. I was working on nipypelines to basically take our existing production workflows and make them available as apps. As part of this exercise i took our resting state workflow which requires ANTS, FreeSurfer, FSL, nipy, nipype, a couple of mindboggle files. Turned out the docker image for this was large. So i started looking into reprozip inside a docker container to run the application script and capture the needed files that can then be reproduced in a different docker container.

for repeatability, reprozip is great. however, reprozip captures things that are not necessary for redistributing the application/script.

so i went through the exercise of minimizing my application footprint (deleting input files). uncompressed, my filesystem footprint was 444M.

so i can put this into a docker image (like philcryer/min-wheezy), test it, squash it, and then upload it.

basic workflow

docker pull some image turn script on and install any necessary packages install reprozip install your application/service with reprozip trace make the entry point for the app run with reprozip trace or not depending on flag run the application on as many data analysis pathways you like

take reprounzip unpack and remove the data files that are not required. (i could have done this with a better reprozip config - to ignore certain locations - which is what i will do in my next run). repackage the application on to a minimal docker image and distribute.

questions:

  1. this brings up how do we figure out how to repackage this when software and data change. and this is where i think the meta model can be useful, but needs to capture a few things - where a particular library or binary or file came from and how to reproduce that entity. and can we formalize the above approach in a script that can regenerate the process automatically.
  2. based on the simple_workflow exercise that @dnkennedy and i have been doing, we know that containers/software will need to be created across different OS's and tested across different environments. the one issue i have with the containerization philosophy is that it takes numerical computing variability out and unless the underlying software are really well tested (which they are not), we only solve a superficial problem (repeatability), but not the deeper problem of generalizability (across data, software, environments).
  3. this separation between containers as Apps for CRN vs ReproNim providing not just apps, but validation on those apps across platforms may be a nice way to collaborate between the two projects.

perhaps we can discuss this at the hangout tomorrow (I'm happy to join).

jbpoline commented 8 years ago

Hey,

One thing I am wondering about: say someone uses one of this app(lication), can you trace what was installed and investigate easily what is in the installed package ? If I have a docker container, I can list the installed package and look at their versions, can you do this after reprozip ?

talk soon ! (on Monday, may be not tomorrow :) JB

On 6 August 2016 at 10:40, Satrajit Ghosh notifications@github.com wrote:

For the CRN sprint at Stanford, people worked on packaging applications: https://github.com/BIDS-Apps using Docker. I was working on nipypelines to basically take our existing production workflows and make them available as apps. As part of this exercise i took our resting state workflow which requires ANTS, FreeSurfer, FSL, nipy, nipype, a couple of mindboggle files. Turned out the docker image for this was large. So i started looking into reprozip inside a docker container to run the application script and capture the needed files that can then be reproduced in a different docker container.

for repeatability, reprozip is great. however, reprozip captures things that are not necessary for redistributing the application/script.

so i went through the exercise of minimizing my application footprint (deleting input files). uncompressed, my filesystem footprint was 444M.

so i can put this into a docker image (like philcryer/min-wheezy), test it, squash it, and then upload it. basic workflow

docker pull some image turn script on and install any necessary packages install reprozip install your application/service with reprozip trace make the entry point for the app run with reprozip trace or not depending on flag run the application on as many data analysis pathways you like

take reprounzip unpack and remove the data files that are not required. (i could have done this with a better reprozip config - to ignore certain locations - which is what i will do in my next run). repackage the application on to a minimal docker image and distribute.

questions:

1.

this brings up how do we figure out how to repackage this when software and data change. and this is where i think the meta model can be useful, but needs to capture a few things - where a particular library or binary or file came from and how to reproduce that entity. and can we formalize the above approach in a script that can regenerate the process automatically. 2.

based on the simple_workflow exercise that @dnkennedy https://github.com/dnkennedy and i have been doing, we know that containers/software will need to be created across different OS's and tested across different environments. the one issue i have with the containerization philosophy is that it takes numerical computing variability out and unless the underlying software are really well tested (which they are not), we only solve a superficial problem (repeatability), but not the deeper problem of generalizability (across data, software, environments). 3.

this separation between containers as Apps for CRN vs ReproNim providing not just apps, but validation on those apps across platforms may be a nice way to collaborate between the two projects.

perhaps we can discuss this at the hangout tomorrow (I'm happy to join).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ReproNim/TRD3/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQyaFA2bxElooMjoyk6OcyTjxDnKw5Lks5qdMcVgaJpZM4JeU_L .

satra commented 8 years ago

@jbpoline - that's intent behind question 1 for discussion - where a particular library or binary or file came from and how to reproduce that entity. - there are certain approaches, but the complexity of doing this is in the details of the framework.

ryan (from TACC) was mentioning a company (i think blackduck), which can link the hash of any file to the source tree when available (that technology is also likely proprietary).

yes monday indeed - my brain still doesn't know what time zone it is in!

yarikoptic commented 8 years ago

great observations and summary, thanks @satra . That would indeed be worthwhile discussing at the hangout. Regarding

how to repackage this when software and data change.

if content comes from debian pkgs, which reprozip tracks/versions, we can just apt-get install --reinstall pkg[=desired_version] and then after such upgrade rerun computation(s) tracing it again.

where a particular library or binary or file came from

indeed every file should have such provenance information. It is easy to obtain in case of debian packages being the source. If we establish layered specification of multiple possible distributions (as we discussed on some google doc ;) debian -> conda -> pip), and identify them per "package", should be achievable. E.g. if we elaborate on top of reprozip format to first describe those distributions in greater detail than just , and then for each pkg describe its origin

distributions:
 - name: debian-1
    origin: debian
    suite: jessie
    date: Fri, 05 Aug 2016 14:07:52 UTC
    components: main non-free contrib
    architectures: amd64, i386
    ... may be more, like apt servers linked to priorities, additional repos with updates etc
 - name: neurodebian-1
    origin: NeuroDebian
    suite: jessie
    date: Fri, 05 Aug 2016 14:07:52 UTC
    components: main non-free contrib
    architectures: amd64
 - name: neurodebian-2
    origin: NeuroDebian
    suite: data
    date: Fri, 05 Aug 2016 14:07:52 UTC
    components: main non-free contrib
    parent: debian-1

 - name: conda-1
   parent: debian-1
    ... whatever relevant to describe

packages:
  - name: "fsl-5.0-core"
    version: "5.0.9-2~nd12.04+1"
    distribution: neurodebian-1
    ... may be more such as  apt-priority: 500
    ... all the files listing if desired from reprozip etc
  - name: fsl-mni152-templates
    version: "5.0.7-2"
    distribution: neurodebian-2
  - name: networkx
    version: "1.11"
    distribution: conda-1

Then we should have enough information to first if needed reconstruct from ground zero the environment, by relying original distribution and date when taken (from snapshot repositories, or relying on distribution's, e.g. as pip or probably conda as well, ability to provide multiple versions of the same pkg). But then allow to override e.g. by providing complimentary specification suchas

packages:
    - name: "fsl-5.0-core"
      version:

which would reset version to be undefined to have most recent available (or some other speicified)... or even change release/suite to generate similar one based on another debian release:

distributions:
 - name: debian-1
    origin: debian
    suite: stretch
    date: 

(need to think probably of providing somehow global overrides, as in this case to reset all pkgs versions to be undefined)

yarikoptic commented 8 years ago

@satra : "ryan (from TACC) was mentioning a company (i think blackduck), which can link the hash of any file to the source tree when available". FWIW, I hope to get the same done for data(set)s we cover by DataLad: https://github.com/datalad/datalad/pull/436 Work is ongoing. How useful though association of source files to source tree? in our case we need 'binaries'. In case of source tree, I think at least for sources present in Debian, we could use the DB behind http://sources.debian.net for which I believe they do collect md5sums of all the files. But, probably what is needed really not sources, but binaries. Not sure if we have it at that level of detail in Debian anywhere right away.