elsasserlab / minute

MINUTE-ChIP data analysis workflow
https://minute.readthedocs.io
MIT License
2 stars 0 forks source link

Add "minute init" subcommand #148

Closed marcelm closed 2 years ago

marcelm commented 2 years ago

The first commit is refactors the command-line handling so that subcommands are automatically discovered: Each module in minute/cli/ automatically becomes a subcommand. The help text is taken from the module-level docstring. I have copied this over from IgDiscover, so this should work fine.

The second commit adds a "init" subcommand that at this point is just a replacement for "mkdir pipelinedir" and "cp config.yaml pipelinedir/". I’m not so sure how much this actually improves your workflow because I don’t know which files you just copy over and which you adjust each time you create a new pipeline directory. So perhaps the usefulness is limited, which is why I’m marking this PR as draft.

cnluzon commented 2 years ago

I would say at this moment having a minute init functionality like this is not changing my workflow a lot. I like the fact that one does not need to think about where the config.yaml file is, but especially in the case where we run on Uppmax, there is always some extra fiddling, like some sbatch file lying around that also needs to be fixed, and some symlinks to be made to the fastq files and so on.

I have some ad-hoc fixes for the Uppmax stuff, like some aliases and some snakemake-profiles configured, but that feels a bit difficult to provide out of the box to someone else in a different environment, and maybe it's unnecessary.

This being said, I think the structuring subcommands could be worth having (e.g. the obvious download probably fits nicely here), and even if the init in itself is a small functionality, it still is neater than going where /path/to/config.yaml is and copying it over.

marcelm commented 2 years ago

The subcommand also has a --reads option that will create the symlink to the directory with FASTQ files. Would you use that?

I also want to copy libraries.tsv and groups.tsv into the created directory. At the moment, I would implement it so that more or less empty templates for these are created, which would then need to be edited. Additionally, one would be able to provide options --libraries=file.tsv and --groups=anotherfile.tsv and then these files would be copied (or symlinked) instead. Would that in any way improve (or impede) your workflow?

marcelm commented 2 years ago

I have added a minute download command to this PR. (Not as a separate PR because I need the first refactoring commit.)

Example run (beginning):

$ minute download
INFO: Files for accession SRR8547557 exist, skipping
INFO: Files for accession SRR8547558 exist, skipping
INFO: Running fastq-dump --outdir ./tmpa1cg827q --gzip --split-3 --defline-qual + --defline-seq @$sn SRR8547559
cnluzon commented 2 years ago

The --reads parameter I like!

The "stub" libraries.tsv and groups.tsv I also like. It could even be a bit clever about it and fill in one row per FASTQ files minus the _R1.fastq.gz/_R2.fastq.gz, when --reads is provided.

The copying over of the files I don't see too much of an improvement over copying the actual files if one has them already.

There is a specific use case that can be a pain. It is a usual one but it's not always like this. I think this would be a nice optional thing to do with minute init: It is when we have a set of barcodes that is always the same across all the IPs. For instance, if we have three barcodes representing libraries A, B and C and 4 IPs:

IP1_A    1   AAAAAAAA   IP1
IP1_B    1   CCCCCCCC   IP1
IP1_C    1   TTTTTTTT   IP1
IP2_A    1   AAAAAAAA   IP2
IP2_B    1   CCCCCCCC   IP2
...
IP4_C    1   TTTTTTTT   IP4

Assuming FASTQ files are named IP1_R1.fastq.gz and so on, the ideal situation would be that one only specifies barcodes and replicates, as in:

A   1   AAAAAAAA   
B   1   CCCCCCCC
C   1   TTTTTTTT

And then generate it across --reads files. It's not terrible to do manually, but it's also highly repetitive and a common use case.

marcelm commented 2 years ago

The --reads parameter I like!

Good, then it’ll stay.

The "stub" libraries.tsv and groups.tsv I also like. It could even be a bit clever about it and fill in one row per FASTQ files minus the _R1.fastq.gz/_R2.fastq.gz, when --reads is provided.

It sounds good, but looking into how to actually do the clever bit, I wonder whether it would work so well. I see two problems:

The copying over of the files I don't see too much of an improvement over copying the actual files if one has them already.

Ok, makes sense.

There is a specific use case that can be a pain. It is a usual one but it's not always like this. I think this would be a nice optional thing to do with minute init: It is when we have a set of barcodes that is always the same across all the IPs. For instance, if we have three barcodes representing libraries A, B and C and 4 IPs:

IP1_A    1   AAAAAAAA   IP1
IP1_B    1   CCCCCCCC   IP1
IP1_C    1   TTTTTTTT   IP1
IP2_A    1   AAAAAAAA   IP2
IP2_B    1   CCCCCCCC   IP2
...
IP4_C    1   TTTTTTTT   IP4

Assuming FASTQ files are named IP1_R1.fastq.gz and so on, the ideal situation would be that one only specifies barcodes and replicates, as in:

A   1   AAAAAAAA   
B   1   CCCCCCCC
C   1   TTTTTTTT

And then generate it across --reads files. It's not terrible to do manually, but it's also highly repetitive and a common use case.

I was actually wondering about this because even the testing libraries.tsv has a similar redundant pattern. Shall we perhaps discuss this separately? I think if you agree that this PR is already an improvement, then we should merge it and open follow-up PRs or issues for further "cleverness" improvements.

cnluzon commented 2 years ago

I think if you agree that this PR is already an improvement, then we should merge it and open follow-up PRs or issues for further "cleverness" improvements.

Agreed :) Other improvements that are not so urgent.