DEpt-metagenom / containerized_genomics_tools

This repository hosts the files necessary to build and test containers for genomics tools
1 stars 2 forks source link

Fastp added #3

Open gpetho opened 1 month ago

gpetho commented 1 month ago

I have added the fastp tool to the repository.

@laczkol @zradai Please check whether this is organised according to what you have requested apart from the fact that the .sif file has not been moved to the ZFS directory yet. The repository has a Dockerfile, an Apptainer .def file, a Makefile that pulls, builds and pushes the Docker image to and from the institute Docker Hub, contains a build_apptainer goal to build the Apptainer image obviously, and a test goal to test both images by running the tool on an input file and comparing the output to the reference output included in the repository.

Is there anything missing apart from a readme, which will be added later? If it is correct like this, the same pattern will be followed for all other tools.

laczkol commented 1 month ago

I could pull and run the container with both Docker and Apptainer without any problems. The repository also contains the .sif file, which is unusual. I do not really understand the purpose of Apptainer.def. The already existing image could simply be pulled from DockerHub without needing this definition file and without sudo permissions. I would change apptainer build $(SIF_FILE) $(APPTAINER_DEF) to apptainer pull $(SIF_FILE) $(DOCKERHUB) with DOCKERHUB set to docker://deptmetagenom/fastp:0.23.4 in the Makefile. It could also be also convenient later to set the latest tag in DockerHub so that the latest image is always pulled by default and the tag (version) does not have to be set when the image is pulled with Apptainer. At the moment, of course, this is not really a problem as there is only one version in the DockerHub repository, but it might be a practice we should take up. When running the container shell, fastp is not available in the $PATH, which could be a minor inconvenience. The entry point is set correctly, so we should remember to use fastp with this container by running docker run fastp or apptainer run fastp.sif, or start fastp by specifying its path when using apptainer exec (i.e. apptainer exec fastp.sif /app/fastp). I am describing this because although setting the entry point is the correct way, it is not always followed and fastp not being available in the$PATH could lead to some confusion. I am just asking out of curiosity: is there a particular reason why you used Alpine as the base image? Are there any advantages to using this base aside from it being very lightweight?

gpetho commented 1 month ago

Thank you for reviewing this. So basically there is nothing else needed, we can proceed

The repository also contains the .sif file, which is unusual.

Yes, well, it's small, so it doesn't hurt to put it in the repository. I'm keeping it there and will be putting the .sif files in the other repositories as well as long as they are so small.

I do not really understand the purpose of Apptainer.def. The already existing image could simply be pulled from DockerHub without needing this definition file and without sudo permissions. I would change apptainer build $(SIF_FILE) $(APPTAINER_DEF) to apptainer pull $(SIF_FILE) $(DOCKERHUB) with DOCKERHUB set to docker://deptmetagenom/fastp:0.23.4 in the Makefile.

ChatGPT told me to do it like this, and it worked, so I saw no reason to change it. The reason to use a .def file might have something to do with the fact that the entrypoint (%runscript) is set in it. Maybe it would work like this if I just pulled the Docker image with apptainer, and it would simply adopt the Docker image's entrypoint, I haven't tried.

When running the container shell, fastp is not available in the $PATH, which could be a minor inconvenience. The entry point is set correctly, so we should remember to use fastp with this container by running docker run fastp or apptainer run fastp.sif, or start fastp by specifying its path when using apptainer exec (i.e. apptainer exec fastp.sif /app/fastp). I am describing this because although setting the entry point is the correct way, it is not always followed and fastp not being available in the$PATH could lead to some confusion.

As I said, I will be adding a readme which will explain exactly this, i.e. that the Docker and Apptainer containers need to be run like this, furthermore adding information on mounting the host directory that contains the input and where the output will be saved.

I am just asking out of curiosity: is there a particular reason why you used Alpine as the base image? Are there any advantages to using this base aside from it being very lightweight?

No, that's the (only) reason why everybody uses alpine as the base image. The image should be as small as possible.

laczkol commented 1 month ago

ChatGPT told me to do it like this, and it worked, so I saw no reason to change it. The reason to use a .def file might have something to do with the fact that the entrypoint (%runscript) is set in it. Maybe it would work like this if I just pulled the Docker image with apptainer, and it would simply adopt the Docker image's entrypoint, I haven't tried.

I have and the entrypoint is "adopted", the Docker --> Apptainer works flawlessly here. There is no need to set the entrypoint again. If using the .def file is not absolutely necessary and the conversion works I would definitely just pull the Docker image.

gpetho commented 1 month ago

I have and the entrypoint is "adopted", the Docker --> Apptainer works flawlessly here. There is no need to set the entrypoint again.

Right, then there is no point in building the Apptainer image from .def, just pulling it does the same. Ultimately it doesn't matter, we get the exact same result either way, but I'll change it to apptainer pull as requested.

This is ChatGPT's much more detailed answer to your original question:

The key difference between pulling a Docker image with apptainer pull docker://mydockerimage:latest and building an Apptainer image using a minimal .def file like the one you provided lies in flexibility, control, and customization. Here’s a breakdown of both approaches and whether building an image offers any benefits:

Pulling a Docker Image with Apptainer

When you use the command apptainer pull docker://mydockerimage:latest, Apptainer directly converts the Docker image into an Apptainer .sif image. The result is an Apptainer container that mirrors the Docker image, including its layers, entrypoint, environment variables, and any default commands.

Pros:

  • Simplicity: It’s fast and straightforward. You get the Docker image as-is, with no modifications required.
  • Automatic conversion: Apptainer handles the conversion for you, so there’s no need to define anything in a .def file.
  • Minimal overhead: This method doesn’t require extra steps to configure or set up; it just grabs the Docker image and creates the Apptainer .sif file.

Cons:

  • Lack of customization: You can’t modify or customize the Docker image before converting it to Apptainer. For example, if you want to change environment variables, add files, or run setup commands, this approach doesn't allow that easily.

Building an Apptainer Image with a .def File

Using a .def file with Apptainer gives you more control and customization over the image build process. Your .def file specifies how the Apptainer image is constructed from the Docker base image, and it can include additional steps during the %post and %runscript phases.

Here’s what happens:

  • Bootstrap: The From command pulls the base Docker image (mydockerimage:latest).
  • Customization: You can add extra setup or configuration commands in the %post section (such as installing additional software, modifying the environment, etc.).
  • Custom run behavior: You can specify custom commands or scripts that run when the container is executed via the %runscript section.

Pros:

  • Customization: You have complete control over how the Apptainer image is built. You can:
    • Add or modify files.
    • Install additional dependencies.
    • Change entry points or default behavior.
  • Flexibility: If you need to make changes to the Docker image, even minor adjustments like setting environment variables or adding configuration files, this method allows for that.
  • Pre-build tasks: You can automate tasks like setting permissions, downloading datasets, or other pre-processing steps during the build process.

Cons:

  • More complexity: It requires a bit more effort since you’re defining your image build process.
  • No real benefit for identical images: If you don’t need to make any changes to the Docker image (i.e., you want the Apptainer image to run exactly like the Docker image), this method may not offer significant advantages.

Key Differences

  1. Customization:

    • Pulling a Docker image gives you no opportunity to change or customize anything.
    • Using a .def file allows for full customization of the image, both during the build and run phases.
  2. Entrypoint and Run Behavior:

    • When you pull the Docker image, Apptainer respects the Docker image’s entrypoint and run configuration by default.
    • In the .def file, you can override or modify the entrypoint with the %runscript section, but you don’t have to. You can leave the image’s behavior the same by simply using exec "$@" in the %runscript (i.e., forwarding the same behavior as the Docker container).
  3. Pre/Post Setup:

    • Pulling doesn’t allow for pre/post setup actions like installing extra packages or modifying configurations.
    • The .def file enables you to add any setup logic during the %post phase.

Conclusion: Which to Use?

  • Use apptainer pull docker://... if you just want to convert a Docker image to Apptainer format quickly and don’t need any modifications.
  • Use a .def file if:
    • You need to customize the image.
    • You want to add extra dependencies or configuration.
    • You require more control over the build process.

If you don’t need to customize or modify the Docker image in any way, there’s no practical benefit to using the .def file over a simple apptainer pull. However, if you foresee needing any flexibility down the line, the .def approach gives you the ability to modify the image as necessary.

gpetho commented 1 month ago

@laczkol Commit https://github.com/DEpt-metagenom/containerized_genomics_tools/commit/a387040ff0060b105736f16c2a32b37c733dc2ad should have solved the problems that you indicated. Please check whether anything is missing or incorrect.

laczkol commented 1 month ago

I can see that you added the target push as a prerequisite for build_apptainer in the Makefile. Since the prerequisite for push is build, I assume you added this to ensure that a new Docker image is built and pushed to the DockerHub repository. This makes perfect sense and ensures consistency. We could think about having a build_apptainer_only target (like build_apptainer was implemented before the last commit), so that we don't get an error if a host doesn't have Docker and the only goal is to use Apptainer. Since the .sif file can be easily copied and is also included in the repo, I don't know how important this use case is, probably not very much or it doesn't make sense to include it as a separate target in the Makefile. Other than that I've no further comments, I think the structure and Makefile are fine. The README is detailed enough for me. I'd wait for @zradai's comment before finalizing it.

gpetho commented 1 month ago

I can see that you added the target push as a prerequisite for build_apptainer in the Makefile. Since the prerequisite for push is build, I assume you added this to ensure that a new Docker image is built and pushed to the DockerHub repository. This makes perfect sense and ensures consistency.

Yes, I agree that it was a smart idea to implement the workflow like this. I didn't think about it myself, it was @albertbokor-unideb who suggested this solution to me when we discussed my original Makefile, so I have to give him credit for it.

We could think about having a build_apptainer_only target (like build_apptainer was implemented before the last commit), so that we don't get an error if a host doesn't have Docker and the only goal is to use Apptainer. Since the .sif file can be easily copied and is also included in the repo, I don't know how important this use case is, probably not very much or it doesn't make sense to include it as a separate target in the Makefile.

I think this is definitely a good suggestion for tools where the .sif file is too large to be included in the repository. I didn't think of the situation that the user might not have Docker installed when I wrote the Makefile.

As an alternative to the make build_apptainer_only target, which achieves what we need, of course, but sounds kind of lame to me, I think it might be better to modify the build_apptainer target so that it runs make push normally, except when it is called like this: make build_apptainer PUSH=false, in which case the push is skipped. This just requires the addition of a simple condition in the build_apptainer target.

@zradai We're done with this part, it's your turn, please review fastp and let me know whether you are happy with all further tools being structured like this or whether you want anything changed. I suppose you probably are happy with it since you basically just wanted the .sif file, which is there (and will be moved to the specified directory next), the rest was our idea.