cytomining / pycytominer

Python package for processing image-based profiling data
https://pycytominer.readthedocs.io
BSD 3-Clause "New" or "Revised" License
76 stars 34 forks source link

Docker container #214

Closed bethac07 closed 5 months ago

bethac07 commented 2 years ago

I made a proof of principle one here for my own purposes, but if you want, it would be < an hour or two of work to set this up as a GH action inside this repo so that tag:latest is built and pushed every time there is a commit to master and tag:version is pushed with every new official pycytominer version. You also may decide you want the Python version more (or less!) pinned or even matrixed.

Probably also makes sense to make a cytomining Docker org, if Dockers are a thing you want.

d33bs commented 10 months ago

Related notes and thoughts here: cytomining/pycytominer-Docker#2

d33bs commented 9 months ago

A quick update here: I'm going to begin working towards addressing this issue through the creation of a Cytomining organization on Docker Hub (as @bethac07 originally suggested). We can seek support through the Docker-Sponsored Open Source (DSOS) program which will help avoid monetary costs normally associated with Docker Hub organizations. Creating this organization would be a first step here, providing a "place" for Cytomining images to be uploaded and used by others. Once we have an image registry location in place, we could add a Dockerfile and related code within this repository to create a Cytomining/Pycytominer image there.

Please don't hesitate to let me know if you have any questions / concerns!

CC @gwaybio, @kenibrewer

kenibrewer commented 9 months ago

@d33bs Another option that we might consider is using biocontainers organization and tooling. There are some advantages for that route beyond just the fact that we wouldn't need to stand up our own organization. AWS, for example, caches all the biocontainer images locally within their Elastic Container Registry to provide fast access to those images to people running workflows on AWS.

d33bs commented 9 months ago

Thanks @kenibrewer ! I took a look at Biocontainers and thought it looked promising. As a general advantage it seemed like the Biocontainers path might be more directly recognized by the bioinformatics community and timeline-wise faster in terms of finding a place for a Pycytominer image to land (rather than waiting for sponsorship / approval from Docker, which is an unknown). I'm generally unsure of where Pycytominer audiences obtain or implement their containerized work, so would trust you or others on guidance concerning the most valuable way to share Cytomining images.

Double checking: would it be beneficial to consider both Biocontainers and a Docker Hub organization as image registry locations for Pycytominer? Or should we opt for one / the other to help consolidate and reduce sprawl? I noticed that CellProfiler has taken a multi-pronged approach through the biocontainers/cellprofiler and cellprofiler/cellprofiler locations (would welcome any thoughts you have on this too @bethac07)!

Brainstorming: if we do decide to go with the Biocontainers approach I wonder if there's a way we can organize these under the Cytomining umbrella somehow such that when one searches for "Cytomining" that all projects under the organization (and which are available as images) would appear.

bethac07 commented 9 months ago

Hey folks, sorry, it's been a hectic EOY.

I wasn't involved in the original decision to set up our own Dockers for CellProfiler, my recollection (it was 2016 or 2017) is it had to do with "we need these right now for DCP", I'm not sure whether there was already a biocontainer for it at that point at not or at what point they decided to make one (but it looks like the only tag in there is about a year ago - why they don't just use ours, 🤷‍♀️ ). We don't control the biocontainers version in any way, though if they wanted to work together, I'm sure we'd be happy to: I'm all for less sprawl, though don't always love the idea of us then not controlling our own distribution as we like (I have 0 idea what the biocontainers process is, it's on the to-do list but like, several dozen entries down).

FWIW, I was worried getting CellProfiler set up as a Docker org in the new system would be painful, but we were formally approved about an hour after initially got the notice we would need to apply as an OSS (based on timestamps of me bitching about it in Slack), so the application plus waiting to find out time was indeed short. So yes, I'd likely try to do it. It doesn't HURT to squat on it and keep someone else from taking it, if for no other reason than that.

d33bs commented 9 months ago

Thanks @bethac07 - you bring up some great points! Based on this I'm thinking Pycytominer could benefit from being deployed through both Biocontainers and a sponsored Docker org. I'll try to outline why/how below:

Diagramming this a bit below for clarity and possible eventual documentation additions:

---
title: Pycytominer Container Image Distributions
---
flowchart LR

pycytominer["Pycytominer\nrelease"]
subgraph containers["Container Images"]
docker_hub["Docker Hub:\ncytominining/pycytominer"]
subgraph biocontainers
direction LR
biocontainers_docker["Docker Hub:\nbiocontainers/pycytominer"]
biocontainers_quay["Quay:\nbiocontainers/pycytominer"]
end
end

pycytominer --> |automated release| docker_hub
pycytominer --> |manual release| biocontainers
---
title: Pycytominer Distributions
---
flowchart LR
pycytominer["Pycytominer\nrelease"]
subgraph distributions["Distributions"]
subgraph automated_release["Automated releases"]
direction LR
docker_hub["Docker Hub:\ncytominining/pycytominer"]
pypi["PyPI:\npycytominer"]
end
subgraph manual_release["Manual releases"]
direction LR
conda_forge["Conda forge:\npycytominer"]
subgraph Biocontainers["Biocontainers"]
direction LR
biocontainers_docker["Docker Hub:\nbiocontainers/pycytominer"]
biocontainers_quay["Quay:\nbiocontainers/pycytominer"]
end
end
end

pycytominer --> |automated release| automated_release
pycytominer --> |manual release| manual_release
kenibrewer commented 8 months ago

@d33bs Your plan looks really good to me. You bring up excellent points about the value of preventing typosquatting and having a faster automated release process.

kenibrewer commented 8 months ago

One thing that we'll want to be sure to include in our process, is making sure that we have dependabot alerts turned on for the docker builds and that we have a github action setup to re-run the build process on a weekly (?) basis.

Our containers aren't intended to be used in any scenario with a public IP address, but I'd like to keep practicing good security hygiene with our distributions regardless.

d33bs commented 8 months ago

Thanks @kenibrewer ! Regarding re-running the build process on a scheduled basis: I can see benefits in keeping the container image up to date + practice good DevSecOps. It might also have complexity associated with it, especially if we deviate the process from other releases (for example, do we overwrite the container image releases or keep them consistent for stable use?). Thinking about this more made me wonder how Docker container image builds might be parallels to what's performed through PyPI package installations. If we set a standard for Docker container image builds would it then follow that we should do the same for PyPI builds / releases (setting a regular schedule for PyPI releases)? If we plan to not do this for PyPI, should we stick with the same for Docker container images?

kenibrewer commented 8 months ago

@d33bs A bit of helpful context. For version control, version tags are generally supposed to be immutable. So if you request v.1.2.3 of a software from Github, you always get the same source code. Technically, I repo owners have the ability to delete/update tags, but they're not supposed to.

Docker image tags, however, are generally presumed to be mutable. You get a pinned version of the software, and and a patched OS to run it securely securely. So if you're being extra cautious about reproducibility, you often pin your pipeline to a particular image hash instead of the v1.2.3 version tag.

Because we're operating into the scientific software context, there definitely could be the argument for only building our docker images once and never updating the tags. However, I think I'm more in favor of the mutable tags approach for docker images because 1) we've got a robust testing suite and 2) people can still request particular hashes if they want to be extra careful.

bethac07 commented 8 months ago

@kenibrewer I think the major thing to think about here is what the proposal is to do re: security/package updates.

Docker image tags, however, are generally presumed to be mutable. You get a pinned version of the software, and and a patched OS to run it securely

I don't know that I necessarily agree with this. If a Docker container is a unit of reproducibility, which is certainly an idea we want to push (as you point out later), I would say maybe not in a strictly-pypi sense of "once you push a tag, you can't fix it even one minute later", but within ie a few hours, we want to "lock" that container and not build it again. In theory, yes, when CVEs are discovered, it's good to rebuild, but how far back? One version, two, etc? We still see people recommending 10 year old versions of CellProfiler (really), so it's not a trivial idea to rebuild everything.

I think I'd propose this:

What do folks think?

d33bs commented 8 months ago

Thanks @kenibrewer and @bethac07 for your thoughts here! Mostly my concern is for reproducibility - that if one uses a specific image one day it will still be usable the next day (without "yanking the floor from underneath them"). I didn't realize one could reference the Docker Hub image hashes directly and that these are retained indefinitely. This seems like a good strategy to use for rewriting the tagged docker deployments. A side benefit of this is that images appear to retain their created-on datetime stamp when pulled in this way. A possible risk of depending on this would be that I couldn't seem to find a way to reference the available hashes for a particular image:tag on Docker Hub, meaning if this was never saved somewhere we might lose it forever (without saving it somewhere). I don't currently know of a way around this, for example a git log type docker command/API endpoint?

I like the idea of a :{commithash} Docker image per PR merge as this would open up avenues for quick iteration. That said, I think :latest should be considered "the latest stable full release that aligns with PyPI". Thoughts?

Beth brought up a good point about when to stop updating previous releases. My personal thought here is that we should focus on forward-facing updates where we can, relying on dynamic semver capabilities now available to pycytominer, and that part of the user's responsibility will be to keep up to date with versions they require. I think schedule-wise, once a week for the latest re-release would make sense (Dependabot could help here too). The re-build / push could be associated with a GitHub Actions job which could be manually triggered should critical CVE's arise.

kenibrewer commented 8 months ago

I really like @bethac07 's proposal in its entirety. I think it allows us to serve the needs of both the reproducibility and security focused users very elegantly.

@d33bs I think dependabot will work well alerting us to CVEs in our images. For the regularly scheduled builds we can use the schedule github action trigger.

bethac07 commented 8 months ago

Thanks @kenibrewer :)

@d33bs Agree that a :latest is a good idea - I think I'd be tempted to have it be the absolute latest build (and if we want a latest release, make a :latest_release, but I don't feel incredibly strongly about the matter. It would a bit I guess depend on how often we think the release schedule will be - if it's substantially longer than "every 5-8ish PRs", I'd argue more strongly for an absolute latest (but even then, don't much care).

This was the least-worst answer I could find to getting all tags with a quick SO search but there may be something more elegant

gwaybio commented 7 months ago

Providing an update on the DSOS program, to which I applied to last month. See email below:

Hello,

Your application was rejected for lack of documentation on your hub organization, https://hub.docker.com/u/ cytomining. The below specifications will have to be me. Your project repositories on Docker Hub must have documentation that meets the recommended community standards. We recommend a detailed project description on your Docker Hub pages that includes a link to your project in its respective source code repository and contributing guidelines.

Consider the following repository overview best practices. Describe what the image is, the features it offers, and why it should be used. Can include examples of usage or the team behind the project. Explain how to get started with running a container using the image. You can include a minimal example of how to use the image in a Dockerfile. List the key image variants and tags to use them, as well as use cases for the variants. Link to documentation or support sites, communities, or mailing lists for additional resources. Provide contact information for the image maintainers. Include the license for the image and where to find more details if needed. You will need to reapply after those changes to your Docker Hub account are met.

Regards, Docker Support

@d33bs - do you have access to the organization? We should start completing these items and I will reapply once complete. Thanks!

gwaybio commented 7 months ago

Another followup email:

Hi Gregory Way, During our review of your application for Cytomining, we determined that while your project meets most of the program requirements, there is a lack of documentation in one or more of your repositories on Docker Hub. As stated in the Qualification Criteria section of our webpage, we encourage the authors of open-source projects to include documentation that meets the recommended community standards. This means a detailed project description on your Docker Hub repository pages that includes a link to your project source code, licensing information, and a general overview. Projects lacking this information might not receive the Docker Sponsored Open Source badge for their images on Docker Hub. If you would still like to be a part of the Docker-Sponsored Open Source program, we invite you to re-submit an application once you have updated the documentation for your project. If you feel we have made a mistake in our review of your application please feel free to contact us at opensource@docker.com Thank you!

d33bs commented 7 months ago

Thanks @gwaybio ! It sounds like Docker would like for there to be at least one image release which includes all the required items under the Cytomining org on Docker hub in order to achieve approval. That being said, maybe we can continue towards completion of #362 as a first step, following up with the necessary documentation (perhaps implemented through something like: https://github.com/peter-evans/dockerhub-description).