Closed marcelm closed 2 years ago
I would say at this moment having a minute init
functionality like this is not changing my workflow a lot. I like the fact that one does not need to think about where the config.yaml
file is, but especially in the case where we run on Uppmax, there is always some extra fiddling, like some sbatch
file lying around that also needs to be fixed, and some symlinks to be made to the fastq files and so on.
I have some ad-hoc fixes for the Uppmax stuff, like some aliases and some snakemake-profiles configured, but that feels a bit difficult to provide out of the box to someone else in a different environment, and maybe it's unnecessary.
This being said, I think the structuring subcommands could be worth having (e.g. the obvious download
probably fits nicely here), and even if the init
in itself is a small functionality, it still is neater than going where /path/to/config.yaml
is and copying it over.
The subcommand also has a --reads
option that will create the symlink to the directory with FASTQ files. Would you use that?
I also want to copy libraries.tsv
and groups.tsv
into the created directory. At the moment, I would implement it so that more or less empty templates for these are created, which would then need to be edited. Additionally, one would be able to provide options --libraries=file.tsv
and --groups=anotherfile.tsv
and then these files would be copied (or symlinked) instead. Would that in any way improve (or impede) your workflow?
I have added a minute download
command to this PR. (Not as a separate PR because I need the first refactoring commit.)
Example run (beginning):
$ minute download
INFO: Files for accession SRR8547557 exist, skipping
INFO: Files for accession SRR8547558 exist, skipping
INFO: Running fastq-dump --outdir ./tmpa1cg827q --gzip --split-3 --defline-qual + --defline-seq @$sn SRR8547559
The --reads
parameter I like!
The "stub" libraries.tsv
and groups.tsv
I also like. It could even be a bit clever about it and fill in one row per FASTQ files minus the _R1.fastq.gz
/_R2.fastq.gz
, when --reads
is provided.
The copying over of the files I don't see too much of an improvement over copying the actual files if one has them already.
There is a specific use case that can be a pain. It is a usual one but it's not always like this. I think this would be a nice optional thing to do with minute init
: It is when we have a set of barcodes that is always the same across all the IPs. For instance, if we have three barcodes representing libraries A, B and C and 4 IPs:
IP1_A 1 AAAAAAAA IP1
IP1_B 1 CCCCCCCC IP1
IP1_C 1 TTTTTTTT IP1
IP2_A 1 AAAAAAAA IP2
IP2_B 1 CCCCCCCC IP2
...
IP4_C 1 TTTTTTTT IP4
Assuming FASTQ files are named IP1_R1.fastq.gz
and so on, the ideal situation would be that one only specifies barcodes and replicates, as in:
A 1 AAAAAAAA
B 1 CCCCCCCC
C 1 TTTTTTTT
And then generate it across --reads
files. It's not terrible to do manually, but it's also highly repetitive and a common use case.
The
--reads
parameter I like!
Good, then it’ll stay.
The "stub"
libraries.tsv
andgroups.tsv
I also like. It could even be a bit clever about it and fill in one row per FASTQ files minus the_R1.fastq.gz
/_R2.fastq.gz
, when--reads
is provided.
It sounds good, but looking into how to actually do the clever bit, I wonder whether it would work so well. I see two problems:
The copying over of the files I don't see too much of an improvement over copying the actual files if one has them already.
Ok, makes sense.
There is a specific use case that can be a pain. It is a usual one but it's not always like this. I think this would be a nice optional thing to do with
minute init
: It is when we have a set of barcodes that is always the same across all the IPs. For instance, if we have three barcodes representing libraries A, B and C and 4 IPs:IP1_A 1 AAAAAAAA IP1 IP1_B 1 CCCCCCCC IP1 IP1_C 1 TTTTTTTT IP1 IP2_A 1 AAAAAAAA IP2 IP2_B 1 CCCCCCCC IP2 ... IP4_C 1 TTTTTTTT IP4
Assuming FASTQ files are named
IP1_R1.fastq.gz
and so on, the ideal situation would be that one only specifies barcodes and replicates, as in:A 1 AAAAAAAA B 1 CCCCCCCC C 1 TTTTTTTT
And then generate it across
--reads
files. It's not terrible to do manually, but it's also highly repetitive and a common use case.
I was actually wondering about this because even the testing libraries.tsv
has a similar redundant pattern. Shall we perhaps discuss this separately? I think if you agree that this PR is already an improvement, then we should merge it and open follow-up PRs or issues for further "cleverness" improvements.
I think if you agree that this PR is already an improvement, then we should merge it and open follow-up PRs or issues for further "cleverness" improvements.
Agreed :) Other improvements that are not so urgent.
The first commit is refactors the command-line handling so that subcommands are automatically discovered: Each module in
minute/cli/
automatically becomes a subcommand. The help text is taken from the module-level docstring. I have copied this over from IgDiscover, so this should work fine.The second commit adds a "init" subcommand that at this point is just a replacement for "mkdir pipelinedir" and "cp config.yaml pipelinedir/". I’m not so sure how much this actually improves your workflow because I don’t know which files you just copy over and which you adjust each time you create a new pipeline directory. So perhaps the usefulness is limited, which is why I’m marking this PR as draft.