re-think modules organization and structure

fmalmeida commented 2 years ago

It would be nice that the pipeline was also capable of running with singularity and conda, instead of only docker, which would be more similar to what is found in nf-core modules. To do that, the following should be addressed:

The pipeline must be reconfigured so it does not depends on a docker containing the databases
- Ideally, the pipeline would have a command that creates the required databases in a directory chosen by the user
- Then this database is loaded in the pipeline with a parameter, e.g. --databases
- The pipeline would check the available databases and skip processes if its database is not available
Finally, instead of creating a huge docker image from scratch, it would be ideal that each module uses a conda package (if available) or a docker image containing only its tool if not available in conda.
- The custom docker (if necessary) must not create files with root premissions to allow its use with singularity, if the user desires, trying to give more flexibility then only docker.

These changes, are related and would also affect the following issues making them more possible or to be easier implemented:

abhi18av commented 2 years ago

Adding this here, since the question above would decide whether the following information is relevant or not

https://github.com/fmalmeida/bacannot/blob/0cd2c214f09334eafb79bb7b691210a0d29a2547/main.nf#L111

If we continue to use custom modules, then the allocation of module-level parameters could be simplified by using the addParams (params.MODULE_NAME) in the module import statement . As shown here

https://github.com/mtb-bioinformatics/mtbseq-nf/blob/a459f5d95432f088a3e4d94cab415f88d22916a1/workflows/batch_analysis.nf#L4

And then collect all the pipeline parameters within the config file itself, as shown here.

https://github.com/mtb-bioinformatics/mtbseq-nf/blob/a459f5d95432f088a3e4d94cab415f88d22916a1/conf/global_params.config#L137

fmalmeida commented 2 years ago

Hi @abhi18av,

Thank you for pointing that out, however, I believe this one is not relevant, because I actually don't use module-level parameters. I only use the parameters that are set globally to the workflow.

This was a mistake, I thought that I had to declare the parameters I wanted to use, by I figure I hadn't. These have been removed in the new 3.x version of the pipeline.

fmalmeida commented 2 years ago

Note on the issue:

I will try to address this issue in branch https://github.com/fmalmeida/bacannot/tree/remodeling.

Steps:

[x] add a parameter and a workflow to download required databases in a path selected by the user
[x] add a parameter to load the downloaded databases into the pipeline
[x] reconfigure modules to access the database provided as input
[ ] reconfigure modules to have conda, docker and singularity profiles
[ ] go to issue #16

fmalmeida commented 2 years ago

Found a small hindrance while updating the pipeline in the https://github.com/fmalmeida/bacannot/tree/remodeling branch that needs to be addressed before continuing:

The best for each module would be to use its specific conda packages and quay.io images do allow easy conversion between docker, conda and singularity profiles. For example, this is possible for PROKKA module.
However, I few modules depends on more than one conda package at once, therefore, a quay.io image does not exist for this module. So, which would be the best?
- Try to split the module into smaller ones to avoid requiring multiple packages? Would that be possible to all this "problematic" modules?
- Or, I could create a small docker image, called bacannot:misc_tools that will have all this conda packages that are useful for this other modules that could not use the quay.io image. Then, all these "problematic" modules, would use the same "misc_tools" image and conda "env.yml" file

Needs to think about this.

fmalmeida commented 2 years ago

Decided to always use biocontainer images whenever possible. And create a miscellaneous image that hosts the tools for the modules which have more than one conda dependencie and, therefore, does not have a biocontainer.

abhi18av commented 2 years ago

Definitely a great step forward! 👍

Might I suggest to cross-publish it on docker and quay.io registeries so that there's a fallback in case of running across docker rate limits.

fmalmeida commented 2 years ago

What do you mean by cross-publish it? I can upload images to quay.io?

abhi18av commented 2 years ago

Oh, nothing fancy - just that you'll need to tag the same image twice. First for pushing to dockerhub and then again for pushing it to quay.io

https://docs.docker.com/engine/reference/commandline/push/

https://docs.quay.io/solution/getting-started.html

fmalmeida commented 2 years ago

Hindrance mentioned in this comment and thought to be solved in this other comment is actually yet to be solved.

While implementing changes, we saw that the changes for biocontainers would not grasp the entirety of modules and we would still need two or three different images to allocate other tools and programming languages that would be required.

After some thought, we saw that having "half" from biocontainers and "half" from custom docker images would not be the most standardized option ...

We then are yet to decide in the dilemma again:

Either focus on custom docker images, let's say, one for each "main dependency" like, and image for tools that require R, other for python3.6, other python > 3.6, etc ... and with that have only 3 or 4 docker images do be downloaded ...
- Then these docker images could be used for singularity since they would only have the conda packages installed inside them
- We would also provide a profile for conda which would tell the dependencies for each module, still allowing users to have docker, singularity or conda.
Or try to completely remodel the pipeline so the modules are smaller and with less dependencies, allowing that at least 90% of it could be ran with biocontainer
- The problem with this option is the efforts it would require and the time it would take depending on the time we could spend with it.

... 🤔 ...

abhi18av commented 2 years ago

Hi @fmalmeida ,

Unless I'm mistaken, these are all of the docker images used in the pipeline right?

https://github.com/fmalmeida/bacannot/blob/develop/conf/docker.config

While implementing changes, we saw the the changes for biocontainers would not grasp the entirety of modules and we would still need two or three different images to allocate other tools and programming languages that would be required.

Also, if possible could we discuss this further using some example modules, a bit too abstract for me atm 😅

fmalmeida commented 2 years ago

HI @abhi18av,

It is nothing too fancy. It is just because some modules, such as the one that renders the rmarkdown reports or the one for the genome browser for instance, have a few dependencies such as scripts or multiple libraries that would not be available inside the biocontainers images which are designed to have only the tool (or conda package) itself. Or even some modules such as digIS which is not available in conda.

So, if going forward with changing the ones that could be changed to biocontainers, some modules would still require some non-biocontainer images like the bacannot:server; bacannot:renv; bacannot:jbrowse. And we would end up with a mixture of biocontainers and custom images.

The "problem", is not actually a problem for real, is just a concept and that I am not too much of a fan of doing such mixtures, I like things more standardized 😅

Just to be clear, I am still going to refactor the modules to read the databases from outside the containers as suggested when opening the issue. But, instead of changing everything to biocontainers, I am thinking in:

Having the pipeline to read the databases from outside docker images. Having a module that would prepare such directory with required databases.
Then creating smaller custom docker images based on the main dependency (or conflict) of conda packages, like: bacannot:renv; bacannot:py36_env; bacannot:perl_env; etc. With its related tools inside them.
- This would create something between 4/5 images, with all the dependencies without conflicts nor database files, wich would allow them to be used with singularity
- 4/5 os already the amount of images required, that is why I am more favourable to this one. But they would be smaller because the databases will be outside.
  1. Them, try to create a config file for conda, since when using conda we can set more than 1 package per module, this would be possible in theory.

And yes, these are the docker images used.

The point is just that actually, instead of pointing 60%/70% to biocontainers, I am leaning towards creating these custom smaller images, what for me, would be easier to maintain.

😁

abhi18av commented 2 years ago

Hi Felipe,

Thanks for sharing the full context!

For this, allow me to respond after taking a deeper look at this today and by tomorrow, I'll share my thoughts (if you need 😅 ). Perhaps we can find an optimal strategy for this one.

fmalmeida commented 2 years ago

Hi Abhinav,

Thanks for the interest.

For this issue specifically, we have discussed here in the lab and decided to go on as described for now, which is a way that is already established for our local pipelines, thus would follow the standards of the lab and would require less "changes" for the moment being.

But surely, I'd like to hear your thoughts, maybe I could use them to find some optimal strategy as you said for future releases or future pipelines.

😄

abhi18av commented 2 years ago

For this issue specifically, we have discussed here in the lab and decided to go on as described for now, which is a way that is already established for our local pipelines, thus would follow the standards of the lab and would require less "changes" for the moment being.

Ah, okay. Yeah, maintenance burden is an important factor 👍

Then, I'll simply just summarize my suggestions here

An overall preference for small docker images should be given (like you plan), I have seen that on some clouds (like Azure) larger docker images are problematic unless, you move them to the native container registry (like ACR).
You can combine the docker.registry and process.ext options, to allow people to change the base name of the registry dynamically

The second point, could be used in combination with the process selectors to add this dynamic behavior.

Hope this helps (in some way 😉 )!

fmalmeida commented 2 years ago

Many thanks for the suggestions!

About the first one, we were already facing some problems with the image sizes, which also helped triggering this issue 😅

And about the second one, I just loved the idea. I didn't know about these options and how useful they could be.

Thanks again for these nice resources 😄

fmalmeida / bacannot

re-think modules organization and structure #36