MRtrix3 / containers

Developer tools for generating minified containers for MRtrix3
0 stars 2 forks source link

Resolving PRs #4 and #11 #12

Closed Lestropie closed 4 years ago

Lestropie commented 4 years ago

Given the complexity of the similarities and differences between my #4 and @kaczmarj's #11, I thought I'd create a new dedicated thread where we can disentangle them as best we can and figure out the right strategy for merge resolution. It might also make things cleaner for anyone else wanting to bring themselves up to date and take part in the conversation.

Admittedly I ended up doing a lot more work here than I'd intended given @kaczmarj's prior offer for contribution: I'm eager to sort out comprehensive CI testing of MRtrix3 Python scripts (MRtrix3/mrtrix3#2143) and it's my intention for the solution here to be applicable there also. I've also ended up adding some other bits and bobs I needed along the way, which aren't strictly tied to the title of #4 but I merged to that branch nevertheless to prevent divergence in my own work.

So here we go:

  1. Intermediate build:

    • 4 uploads a container with minified dependencies to DockerHub, to be used as the base for build of the actual containers intended for use (the MRtrix3 Docker container for users, and the container used by GitHub Actions for CI testing of MRtrix3 Python scripts);

    • 11 generates multiple tarballs of the minified dependencies and hosts them online, to be directly downloaded in the construction of those containers.

    Overall here, I think I'm in favour of the first approach: the base container can follow its own version tagging, it's transparent what versions of the different software dependencies are embedded within it, and I think it should be easier long-term to manage within DockerHub two containers instead of one, rather than having external downloads managed through some other means.

  2. Parallel build:

    I wasn't aware of this functionality until now. It's not super urgent given that the base container should only need to be built very irregularly, but if the capability is there and working it probably makes sense to make use of it. Having said that, I don't think that utilisation of that capability is predicated on the totality of #11: it should be possible to patch that specific capability into the contents of #4?

  3. Dockerfile envvars:

    I think I prefer having the dependencies of different software packages specified as individual environment variables as in #4, just to partition them out and better manage them over time.

  4. Extra dependencies:

    4 adds ACPCdetect via #10, and FreeSurferColorLUT.txt via #7; #11 also includes the latter. Need to make sure that these are not lost during the resolution process.

  5. FSL installation:

    4 installs FSL via the installation script; #11 does a direct tarball download & extraction. The "general recommendation" from FMRIB seems to be to use the installation script, but I don't know how necessary it is to follow that advice in the context of container construction.

    There's also some added trickery included in master that I devised from MRtrix3_connectome where sometimes the Python components of FSL don't work after the installation script, and so a second installer needs to be run (FMRIB instructions); I don't know whether / how installing via a tarball gets around this. But I have also found that more recently that script branch is no longer necessary; I don't know whether there were changes to the FSL installer script or to FSL itself that have rectified this.

  6. FSL configuration:

    4 runs the FSL configuration script as part of the entrypoint, whereas #11 seems to avoid it and instead manually sets the requisite environment variables itself. I feel as though the former is a bit fragile, e.g. if someone overrides the entrypoint when running the container the config script won't be run. So if the latter works I'm happy to go with that approach.

  7. mrview capability:

    11 additionally installs the dependencies for running GUI applications, and instructions for how to execute such using Docker. These should however be relatively easy to add as a standalone update.


So reflecting on it all, I think the approach that I want to take is to merge #4, and then create individual PRs for the individual additions that #11 provides over and above #4 that we want to utilise, being:

  1. Parallelisation of base container build

  2. Getting mrview working

  3. Setting multiple envvars in single Docker instructions to reduce the number of layers

  4. Possibly avoiding the FSL setup script at container entry

In parallel to this, I'm continuing to work on MRtrix3/mrtrix3#2143, but I'll be able to verify my progress on that a bit better if it's agreed to use the approach of a base container on DockerHub.

But all of the above is open to alternative proposals. While I've used Docker a bit, it's not an area of expertise for me, so if there's good justification for doing anything differently I'm happy to hear it.

Lestropie commented 4 years ago

So I've been experimenting with the prospect of going the other way around: merging #11, and then making the changes I wish to keep from #4 on top of that.

The main one for me was Point 1 (intermediate container vs. tarballs). After attempting to modify #11 to do the former, I've managed to convince myself to stick with the latter. The distinction between build requirements and runtime requirements is clear, which means less cleanup at the start of the second build stage, and it means that updates to just one software dependency can be reflected in an update of just the tarball of that dependency. As long as there's some discipline around ensuring that file names match the external software versions, it should be fine.

Rather than a Dropbox account, I've made an OSF project, so that the purpose and contents are transparent. @kaczmarj: If you can confirm that this is indeed your OSF account, I'll add you.

So the new plan is to do minor tweaks to #11 (mostly to use the final tarball paths), merge, and then I'll make separate PRs for other stuff that got bundled into #4.


For Point 2, in initially mucking around with BuildKit, building the full container sequentially was actually faster for me: the time saved in parallelisation was more than offset by the time taken to merge the contents of /opt/fsl across layers (though admittedly it was on a RAID1). Bumping MAKE_JOBS up from 2 to 4 on an 8-thread system is fractionally faster than sequential. But it's actually a nice framework just for the sake of modularity of the build from a conceptual perspective.

kaczmarj commented 4 years ago

Rather than a Dropbox account, I've made an OSF project, so that the purpose and contents are transparent. @kaczmarj: If you can confirm that this is indeed your OSF account, I'll add you.

Yes, I confirm this is me. Thank you for asking.

And thank you for going through all of the work of trying out these various options. I didn't originally consider your idea of having a minified container serve as the base for other images, but as you wrote, I think having minified dependencies exist as separate tarballs online results in a cleaner, more modular, and more maintainable codebase. Thank you also for the updates to #11 .

Lestropie commented 4 years ago

Closed through combination of #11 and #13.

It was definitely informative for me to go through the process of implementing and experimenting with both approaches in order to encounter the differences rather than hypothesizing about them. Your choices made more sense the longer I spent on it. Cheers for your contributions here! :+1: