PMCC-BioinformaticsCore / janis-core

Core python modules for Janis Pipeline workflow assistant
GNU General Public License v3.0
4 stars 9 forks source link

Threading groups of inputs #29

Open drtconway opened 4 years ago

drtconway commented 4 years ago

Hi Janis,

There's a reasonably common bit of boilerplate that comes up when composing tools - declaring inputs for all the reference-like things, that are the same as those in one or more of the tools invoked, then threading them in.

For example:

...
        self.input(
            "snps_dbsnp",
            VcfTabix,
            doc=InputDocumentation(
                "From the GATK resource bundle, passed to BaseRecalibrator as ``known_sites``",
                quality=InputQualityType.static,
                example="HG38: https://console.cloud.google.com/storage/browser/genomics-public-data/references/hg38/v0/\n\n"
                "(WARNING: The file available from the genomics-public-data resource on Google Cloud Storage is NOT compressed and indexed. This will need to be completed prior to starting the pipeline.\n\n"
                "File: gs://genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz",
            ),
        )
        self.input(
            "snps_1000gp",
            VcfTabix,
            doc=InputDocumentation(
                "From the GATK resource bundle, passed to BaseRecalibrator as ``known_sites``",
                quality=InputQualityType.static,
                example="HG38: https://console.cloud.google.com/storage/browser/genomics-public-data/references/hg38/v0/\n\n"
                "File: gs://genomics-public-data/references/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz",
            ),
        )

...

        self.step(                                                                                                                                       
            "vc_gatk",                                                                                                                                   
            GatkSomaticVariantCaller_4_1_3(
                normal_bam=self.normal_bam,
                tumor_bam=self.tumor_bam,
                normal_name=self.normal_name,
                tumor_name=self.tumor_name,
                intervals=self.gatk_intervals,
                reference=self.reference,
                snps_dbsnp=self.snps_dbsnp,
                snps_1000gp=self.snps_1000gp,
                known_indels=self.known_indels,
                mills_indels=self.mills_indels,
            ),
            scatter="intervals", 
        )                                                                                                                                                

This leads to duplication, and room for error.

One possibility would be to use a static method to add groups of inputs. So for example you might have:

class  GatkSomaticVariantCaller_4_1_3(....):
    ...
    @staticmethod
    def reference_inputs(thing):
        thing.input("snps_dbsnp", ...)
        thing.input("snps_1000gp", ...)
        ...etc...

    def constructor(....):
        reference_inputs(self)

This doesn't help much with the input passing. Half an idea about how to reduce that is to use Python's keyword argument magic. It seems somehow like you should be able to do something like:

    self.step("vc_gatk", GatkSomaticVariantCaller_4_1_3(..., **refs)

I don't quite have it figured out, but perhaps a static method on GatkSomaticVariantCaller could return the dictionary.

To quote Terry Pratchet, speaking of Ly Tin Wheedle "at that point the bar closed."