SystemsGenetics / GEMmaker

A workflow for construction of Gene Expression count Matrices (GEMs). Useful for Differential Gene Expression (DGE) analysis and Gene Co-Expression Network (GCN) construction
https://gemmaker.readthedocs.io/en/latest/
MIT License
33 stars 16 forks source link

Upgrade to DSL 2 syntax #245

Closed bentsherman closed 2 years ago

bentsherman commented 2 years ago

Nextflow's DSL 2 syntax provides a much better way to write pipeline scripts. The DSL 2 allows you to place all of the channel logic into a single "workflow" block, instead of this logic being scattered throughout the script and within the process blocks. Additionally, DSL 2 allows you to use the same channel multiple times instead of making copies via into.

GEMmaker has a lot of this glue logic, which makes it very hard to read, understand, and maintain. As such it should benefit tremendously from DSL 2. I intend to make this migration and then simplify the channel logic as much as I can.

I am submitting this PR now so that anyone who's interested can track my progress and provide input along the way. At this point, I have made the bare minimum changes required to satisfy DSL 2, and the workflow runs correctly... mostly. However, I think the channel logic can be simplified much further, so I will continue to work on it. I'll let you know when it's ready to review for merge.

My running checklist:

DSL 2 also introduces the ability to have separate module files. While I don't intend to incorporate this feature in this PR, GEMmaker will likely benefit from being reorganized into modules as well, so migrating to DSL 2 is the first step towards that dream.

spficklin commented 2 years ago

Hi @bentsherman this is awesome you've started this. I've been working on updating AnnoTater (now EnTAP-nf) to be DSL2 complaint as well. So this will be a wild ride for GEMmaker if we reorganize into modules.

bentsherman commented 2 years ago

Another day in the bag. The pipeline script is now much easier to read. Some remaining issues:

Some future ideas that may or may not make it into this PR:

bentsherman commented 2 years ago

I figured out how to deal with the infinite loop caused by Channel.watchPath(). Before, we could just close the channel explicitly, but now the close() method is deprecated. So instead, next_sample writes a "done" file into the process folder, and then the watcher channel closes upon reading this file via until(). The watcher is guaranteed to see this file only after it has seen all other samples. Very handy!

A nice thing is that I was able to move the code from write_stage_files and start_first_batch into the workflow block since it is just groovy code, so that should remove a little bit of overhead. You could do a similar thing with next_sample but I don't think you could guarantee that only one sample is transitioned at a time, and I couldn't find anything in Groovy for doing atomics or mutexes so I'll just leave it be.

bentsherman commented 2 years ago

I was able to condense all of the cleaning processes into one for files and one for directories. I also tried to make a function for the repetitive cleaning logic, but couldn't get the varargs to work.

@spficklin I think this PR is ready for you to review. I know there's a lot in here beyond simply satisfying DSL 2 but there were a lot of downstream benefits, and I was fixing or reorganizing small things as I went. Let me know what you think when you have time. No rush.

bentsherman commented 2 years ago

@spficklin Will the retrieve_sra_metadata process run if it doesn't have any inputs? I assume it will just run once...

spficklin commented 2 years ago

Hey @bentsherman . This updates are great! I did have to fix a few things though.

First, If the user doesn't specify a file of skipped samples I get this error:

Error executing process > 'retrieve_sra_metadata (1)'

Caused by:
  Path value cannot be null

I fixed it by removing the inputs on the retreive_sra_metadata process. We don't need any input channels we can just use the variables from the configuration.

Second, the resume feature was broken. The workflow would not resume if it was preempted. I had to fix it by making the following changes.

  1. I had to add back in the code that moved the files from the processing folder back into the stage folder. This is necessary because the channel that watches the process folder doesn't seem to fire if the file was already.
  2. The code that created the stage files would double up the contents of any stage files that were already in the stage folder, and that would prevent them from being processed. I had to add some if statements to leave the files alone if they already existed.
  3. The workflow would hang if it had been completed successfully but then was resumed. I had add a DONE file to the process folder to trigger the workflow to move along.

I made one other change, which wasn't a bug fix. I added some if statements around process calls that don't need to get used. For example, if the user didn't give any SRA files then there's no need to have the download process show up in the Nextflow output.

Finally, the naming of some of the channels confused me, so in the process of debugging I renamed some of them. It was just a few. Take a look and if you are okay with my updates I think we can merge this.

bentsherman commented 2 years ago

@spficklin Thanks for the edits. I thought I had the resume working, but maybe I didn't test resuming from a partial run. I made a few formatting edits.

The code you re-added for your point (1) was already in the beginning of the workflow block. Please resolve that per my comment.

spficklin commented 2 years ago

Hey @bentsherman you are right about item #1. I didn't notice the reworked code that did that. I removed it. This all looks good! I'll merge. Thanks for the quick review.