PMCC-BioinformaticsCore / janis

[Alpha] Janis: an open source tool to machine generate type-safe CWL and WDL workflows
https://janis.readthedocs.io/
GNU General Public License v3.0
41 stars 13 forks source link

documentation: subworkflows #13

Open matthdsm opened 4 years ago

matthdsm commented 4 years ago

Hi,

Would it be possible to add a quick comment on how to use subworkflows? Do I just add them as in a "master" workflow?

Thanks M

illusional commented 4 years ago

You can add them to the registry in the exact same way as a command tool (by declaring it and importing it in the provider’s init. There’s an example here: https://github.com/PMCC-BioinformaticsCore/janis-bioinformatics/blob/master/janis_bioinformatics/tools/common/bwaaligner.py

You can use them interchangeably with a command tool.

Here’s an example where subworkflows are used, and also another sub workflow is declared in the method “ self.process_subpipeline”:

https://github.com/PMCC-BioinformaticsCore/janis-pipelines/blob/6ff3929e56aafe5616cb9fc2310b6d8198c97690/janis_pipelines/wgs_somatic/wgssomatic.py#L65

Same if you use a workflow builder:

subwf = WorkflowBuilder(...)
# build subwf here

wf = WorkflowBuilder(...)
wf.step(“subWfStepId”, subwf(**inputMap))
wf.output(‘outFromSubWf”, source=wf. subWfStepId.nameOfOutput)

I’ll leave this open as I still need to document it.

matthdsm commented 4 years ago

Great! Thanks for the quick reply!

Cheers M

illusional commented 4 years ago

No worries! Keep feeling free to raise issues on here, very happy to answer them!

It’s actually amazing that planes have WIFI.

matthdsm commented 4 years ago

Talk about "over the air" updates 😉

matthdsm commented 4 years ago

Unrelated question: say I have an array of FastqGz from and I need to create a sample map thats consumable as a list of files (fofn in gatk terms)

Practically, I need something along the lines of

bcl2fastq -> Array(FastqGz)
    -> "Unknown method"
        -> FastqGzPair + sampleName: String()
            -> Gatk4FastqToSamLatest.fastqR1, Gatk4FastqToSamLatest.fastqR2

I'm thinking about creating a python tool that parses the list of FastqGz to an object formatted as

{
    samplename: {
        "R1": samplename_R1.fastq.gz,
        "R2": samplename_R2.fastq.gz
    },
    ...
}

but I'm unsure on how to correctly implement this as something that'll make sense in janis.

Any idea's? Advice?

Thanks already. Cheers M

illusional commented 4 years ago

Yes you could build a PythonTool that returned an object:

{
    “sampleName”: YourSampleName,
    "R1": samplename_R1.fastq.gz,
    "R2": samplename_R2.fastq.gz
}

Which could map to the outputs:

Ultimately, it would be useful in Janis to refer to the first index of an output (eg: w.bclStep.fastqs[0]), but we’re a little bit off that in #8

matthdsm commented 4 years ago

Great, thanks! So I suppose something like this should work?

class GenerateSampleMap(janis.PythonTool):
    def id(self):
        return "GenerateSampleMap"

    def version(self):
        return "v0.0.1"

    @staticmethod
    def code_block(files_list: List[str]):
        samplemap = {}
        for filename in files_list:
            samplename = filename.split("_S")[0]
            if not samplename in samplemap:
                samplemap[samplename] = {}
            if "R1" in filename:
                samplemap[samplename]["R1"] = filename
            elif "R2" in filename:
                samplemap[samplename]["R2"] = filename

        return [{"samplename": k, **v} for k, v in samplemap.items()]

    def outputs(self) -> List[List[TOutput]]:
        return [
            TOutput("samplename", String()),
            TOutput("R1", FastqGz()),
            TOutput("R2", FastqGz()),
        ]
illusional commented 4 years ago

Ah I see I see. We don’t support these custom structures. I’d recommend making each return type an array:


    def outputs(self) -> List[List[TOutput]]:
        return [
            TOutput("samplename", Array(String())),
            TOutput("R1", Array(FastqGz())),
            TOutput("R2", Array(FastqGz())),
        ]

(And changing your python code to suit)

Then when you use the result from this, you can dot scatter on all three fields: https://github.com/PMCC-BioinformaticsCore/janis-workshops/blob/master/workshop2/6-scatter.md

matthdsm commented 4 years ago

Awesome, thanks for the help Code is now

class GenerateSampleMap(janis.PythonTool):
    def id(self):
        return "GenerateSampleMap"

    def version(self):
        return "v0.0.1"

    @staticmethod
    def code_block(files_list: List[str]):
        samplemap = {}
        for filename in files_list:
            samplename = filename.split("_S")[0]
            if not samplename in samplemap:
                samplemap[samplename] = {}
            if "R1" in filename:
                samplemap[samplename]["R1"] = filename
            elif "R2" in filename:
                samplemap[samplename]["R2"] = filename

        return [[v[key] for key in sorted(v.keys())] for k, v in samplemap.items()]

    def outputs(self) -> List[List[TOutput]]:
        return [
            TOutput("R1", FastqGz()),
            TOutput("R2", FastqGz()),
        ]

which outputs roughly as

[['D1710903_S64_R1_001.fastq.gz', 'D1710903_S64_R2_001.fastq.gz'], ['D1820847_S46_R1_001.fastq.gz', 'D1820847_S46_R2_001.fastq.gz'], ['D1900814_S78_R1_001.fastq.gz', 'D1900814_S78_R2_001.fastq.gz'], ['D1904578_S33_R1_001.fastq.gz', 'D1904578_S33_R2_001.fastq.gz'], ['D1905752_S79_R1_001.fastq.gz', 'D1905752_S79_R2_001.fastq.gz'], ['D1908147_S47_R1_001.fastq.gz', 'D1908147_S47_R2_001.fastq.gz'], ['D1821957_S71_R1_001.fastq.gz', 'D1821957_S71_R2_001.fastq.gz'], ['D1905632_S84_R1_001.fastq.gz', 'D1905632_S84_R2_001.fastq.gz'], ['D1908155_S48_R1_001.fastq.gz', 'D1908155_S48_R2_001.fastq.gz'], ['D1812139_S1_R1_001.fastq.gz', 'D1812139_S1_R2_001.fastq.gz'], ['D1901986_S98_R1_001.fastq.gz', 'D1901986_S98_R2_001.fastq.gz'], ['D1907884_S45_R1_001.fastq.gz', 'D1907884_S45_R2_001.fastq.gz'], ['D1822234_S77_R1_001.fastq.gz', 'D1822234_S77_R2_001.fastq.gz'], ['D1905676_S2_R1_001.fastq.gz', 'D1905676_S2_R2_001.fastq.gz'], ['D1908600_S3_R1_001.fastq.gz', 'D1908600_S3_R2_001.fastq.gz']]

and is ideal for a dotproduct as you said!

Thanks! M