fmalmeida / bacannot

Generic but comprehensive pipeline for prokaryotic genome annotation and interrogation with interactive reports and shiny app.
https://bacannot.readthedocs.io/en/latest/
GNU General Public License v3.0
96 stars 9 forks source link

re-think modules organization and structure #36

Closed fmalmeida closed 2 years ago

fmalmeida commented 2 years ago

It would be nice that the pipeline was also capable of running with singularity and conda, instead of only docker, which would be more similar to what is found in nf-core modules. To do that, the following should be addressed:


These changes, are related and would also affect the following issues making them more possible or to be easier implemented:

abhi18av commented 2 years ago

Adding this here, since the question above would decide whether the following information is relevant or not

https://github.com/fmalmeida/bacannot/blob/0cd2c214f09334eafb79bb7b691210a0d29a2547/main.nf#L111

If we continue to use custom modules, then the allocation of module-level parameters could be simplified by using the addParams (params.MODULE_NAME) in the module import statement . As shown here

https://github.com/mtb-bioinformatics/mtbseq-nf/blob/a459f5d95432f088a3e4d94cab415f88d22916a1/workflows/batch_analysis.nf#L4

And then collect all the pipeline parameters within the config file itself, as shown here.

https://github.com/mtb-bioinformatics/mtbseq-nf/blob/a459f5d95432f088a3e4d94cab415f88d22916a1/conf/global_params.config#L137

fmalmeida commented 2 years ago

Hi @abhi18av,

Thank you for pointing that out, however, I believe this one is not relevant, because I actually don't use module-level parameters. I only use the parameters that are set globally to the workflow.

This was a mistake, I thought that I had to declare the parameters I wanted to use, by I figure I hadn't. These have been removed in the new 3.x version of the pipeline.

fmalmeida commented 2 years ago

Note on the issue:

I will try to address this issue in branch https://github.com/fmalmeida/bacannot/tree/remodeling.

Steps:

fmalmeida commented 2 years ago

Found a small hindrance while updating the pipeline in the https://github.com/fmalmeida/bacannot/tree/remodeling branch that needs to be addressed before continuing:

Needs to think about this.

fmalmeida commented 2 years ago

Decided to always use biocontainer images whenever possible. And create a miscellaneous image that hosts the tools for the modules which have more than one conda dependencie and, therefore, does not have a biocontainer.

abhi18av commented 2 years ago

Definitely a great step forward! 👍

Might I suggest to cross-publish it on docker and quay.io registeries so that there's a fallback in case of running across docker rate limits.

fmalmeida commented 2 years ago

What do you mean by cross-publish it? I can upload images to quay.io?

abhi18av commented 2 years ago

Oh, nothing fancy - just that you'll need to tag the same image twice. First for pushing to dockerhub and then again for pushing it to quay.io

https://docs.docker.com/engine/reference/commandline/push/

https://docs.quay.io/solution/getting-started.html

fmalmeida commented 2 years ago

Hindrance mentioned in this comment and thought to be solved in this other comment is actually yet to be solved.

While implementing changes, we saw that the changes for biocontainers would not grasp the entirety of modules and we would still need two or three different images to allocate other tools and programming languages that would be required.

After some thought, we saw that having "half" from biocontainers and "half" from custom docker images would not be the most standardized option ...

We then are yet to decide in the dilemma again:

  1. Either focus on custom docker images, let's say, one for each "main dependency" like, and image for tools that require R, other for python3.6, other python > 3.6, etc ... and with that have only 3 or 4 docker images do be downloaded ...
    • Then these docker images could be used for singularity since they would only have the conda packages installed inside them
    • We would also provide a profile for conda which would tell the dependencies for each module, still allowing users to have docker, singularity or conda.
  2. Or try to completely remodel the pipeline so the modules are smaller and with less dependencies, allowing that at least 90% of it could be ran with biocontainer
    • The problem with this option is the efforts it would require and the time it would take depending on the time we could spend with it.

... 🤔 ...

abhi18av commented 2 years ago

Hi @fmalmeida ,

Unless I'm mistaken, these are all of the docker images used in the pipeline right?

https://github.com/fmalmeida/bacannot/blob/develop/conf/docker.config

While implementing changes, we saw the the changes for biocontainers would not grasp the entirety of modules and we would still need two or three different images to allocate other tools and programming languages that would be required.

Also, if possible could we discuss this further using some example modules, a bit too abstract for me atm 😅

fmalmeida commented 2 years ago

HI @abhi18av,

It is nothing too fancy. It is just because some modules, such as the one that renders the rmarkdown reports or the one for the genome browser for instance, have a few dependencies such as scripts or multiple libraries that would not be available inside the biocontainers images which are designed to have only the tool (or conda package) itself. Or even some modules such as digIS which is not available in conda.

So, if going forward with changing the ones that could be changed to biocontainers, some modules would still require some non-biocontainer images like the bacannot:server; bacannot:renv; bacannot:jbrowse. And we would end up with a mixture of biocontainers and custom images.

The "problem", is not actually a problem for real, is just a concept and that I am not too much of a fan of doing such mixtures, I like things more standardized 😅

Just to be clear, I am still going to refactor the modules to read the databases from outside the containers as suggested when opening the issue. But, instead of changing everything to biocontainers, I am thinking in:

  1. Having the pipeline to read the databases from outside docker images. Having a module that would prepare such directory with required databases.
  2. Then creating smaller custom docker images based on the main dependency (or conflict) of conda packages, like: bacannot:renv; bacannot:py36_env; bacannot:perl_env; etc. With its related tools inside them.
    • This would create something between 4/5 images, with all the dependencies without conflicts nor database files, wich would allow them to be used with singularity
    • 4/5 os already the amount of images required, that is why I am more favourable to this one. But they would be smaller because the databases will be outside.
      1. Them, try to create a config file for conda, since when using conda we can set more than 1 package per module, this would be possible in theory.

And yes, these are the docker images used.

The point is just that actually, instead of pointing 60%/70% to biocontainers, I am leaning towards creating these custom smaller images, what for me, would be easier to maintain.

😁

abhi18av commented 2 years ago

Hi Felipe,

Thanks for sharing the full context!

For this, allow me to respond after taking a deeper look at this today and by tomorrow, I'll share my thoughts (if you need 😅 ). Perhaps we can find an optimal strategy for this one.

fmalmeida commented 2 years ago

Hi Abhinav,

Thanks for the interest.

For this issue specifically, we have discussed here in the lab and decided to go on as described for now, which is a way that is already established for our local pipelines, thus would follow the standards of the lab and would require less "changes" for the moment being.

But surely, I'd like to hear your thoughts, maybe I could use them to find some optimal strategy as you said for future releases or future pipelines.

😄

abhi18av commented 2 years ago

For this issue specifically, we have discussed here in the lab and decided to go on as described for now, which is a way that is already established for our local pipelines, thus would follow the standards of the lab and would require less "changes" for the moment being.

Ah, okay. Yeah, maintenance burden is an important factor 👍

Then, I'll simply just summarize my suggestions here

The second point, could be used in combination with the process selectors to add this dynamic behavior.

Hope this helps (in some way 😉 )!

fmalmeida commented 2 years ago

Many thanks for the suggestions!

About the first one, we were already facing some problems with the image sizes, which also helped triggering this issue 😅

And about the second one, I just loved the idea. I didn't know about these options and how useful they could be.

Thanks again for these nice resources 😄