BD2KGenomics / toil-scripts

Toil workflows for common genomic pipelines
Apache License 2.0
33 stars 18 forks source link

Move generic job functions for tools to common lib (RNA-seq) #288

Closed jvivian closed 8 years ago

hannes-ucsc commented 8 years ago

What's this about?

jvivian commented 8 years ago

@arkal and I were discussing modularity of the toil-scripts, and the way he's setting up ProTect is that generic job functions for tools should be imported and that the pipeline script is mostly just a wrapper wiring these tools together. This avoids having to import from another pipeline and centralizes our tool base of job functions. I.E. aligners.py contains job functions like STAR and BWA, quantifiers.py contains job functions like RSEM and Kallisto etc.

hannes-ucsc commented 8 years ago

And where exactly would those function live?

jvivian commented 8 years ago

Common lib: toil_scripts/lib/quantifiers.py, toil_scripts/lib/aligners.py, etc

fnothaft commented 8 years ago

I'm not entirely sure I grok this proposal. General thoughts below:

and the way he's setting up ProTect is that generic job functions for tools should be imported and that the pipeline script is mostly just a wrapper wiring these tools together.

I strongly agree with this! I've strived to be like this in the adam_gatk_pipeline. Is this similar to what @arkal is planning to do?

However, an important caveat:

This avoids having to import from another pipeline and centralizes our tool base of job functions.

Oftentimes, you want to import an encapsulated series of jobs from another pipeline. E.g., the gatk_preprocess stages are shared between adam_gatk_pipeline and exome_variant_pipeline.

This may be a terminology nit. What do you (@jvivian) and @arkal define "a pipeline" to be?

My operating definition is that a pipeline = a workflow = a stitched together DAG of jobs. Given that definition, I don't think of STAR or BWA or etc as jobs because they're composed of multiple different jobs (e.g., download_sample, download_indices, align_reads, etc).

I.E. aligners.py contains job functions like STAR and BWA, quantifiers.py contains job functions like RSEM and Kallisto etc.

I think the answer to my next question hinges on top of the prior terminology question (RE: what is a "job" vs "a pipeline"?), but I'm going anyways. Do you plan to have the aligners in aligners.py implement a common interface? This would be something higher level than Job —> e.g., AlignmentJob.

If the answer is yes, what do these interfaces look like? This problem is fairly straightforward for alignment—all aligners more or less look the same: an aligner takes unaligned reads and an index and produces aligned reads—but looking at quantification, RSEM and Kallisto are drastically different (alignment-based vs. alignment-free, totally different parameters). How do you handle all the tools having different sets of parameters?

If the answer is no, then why are you lumping different tools together into a single module? If it is just for organizational purposes, I would rather not have a single module per tool type. A single package per tool type does seem reasonable though. E.g., toil.tools.aligners is a package that contains the toil.tools.aligners.bwa, toil.tools.aligners.star, etc. modules.

Common lib: toil_scripts/lib/quantifiers.py, toil_scripts/lib/aligners.py, etc

I don't think that the common lib is a logical home for these tools. When I think "common/core lib", I tend to think of things that are algorithm/problem agnostic.

The way I would think of this for toil is that while we want to use toil for genomics, at the end of the day, nothing about toil is genomics specific. We can use toil for astrophysics pipelines just as well as we can use toil for NLP pipelines just as well as we can use toil for genomics and so on. Thus, functionality that is genomics specific should not go into the core lib —> https://github.com/BD2KGenomics/toil/issues/922.

I do think toil_scripts is genomics specific, but I would prefer a different package layout. I would suggest toil_scripts.tools.x.y where x is a tool type ("aligner", "quantifier", etc) and y is a specific tool ("RSEM", "Kallisto").

jvivian commented 8 years ago

What do you (@jvivian) and @arkal define "a pipeline" to be? My operating definition is that a pipeline = a workflow = a stitched together DAG of jobs. Given that definition, I don't think of STAR or BWA or etc as jobs because they're composed of multiple different jobs (e.g., download_sample, download_indices, align_reads, etc).

We have the same definition. STAR and BWA aren't part of a pipeline until someone does the wiring. They can be defined as stand alone functions with inputs that are pipeline-independent, with a return statement for the output of STAR that is also pipeline generic.

Do you plan to have the aligners in aligners.py implement a common interface?

I was striving for arguments to the job functions that are pipeline independent. Generally, I've been making the arguments be FileStoreIDs of the inputs and nothing else. No vague / enigmatic "config" or "inputs" objects/dicts. E.G.

def index_bam(job, bam_id):
    """
    Runs samtools index to create (.bai) files

    :param JobFunctionWrappingJob job: passed automatically by Toil
    :param str bam_id: FileStoreID of the bam file
    :return: BAM index FileStoreID
    :rtype: str
    """
    work_dir = job.fileStore.getLocalTempDir()
    job.fileStore.readGlobalFile(bam_id, os.path.join(work_dir, 'sample.bam'))
    # Call: index the bam
    parameters = ['index', '/data/sample.bam']
    docker_call(work_dir=work_dir, parameters=parameters,
                tool='quay.io/ucsc_cgl/samtools:0.1.19--dd5ac549b95eb3e5d166a5e310417ef13651994e')
    return job.fileStore.writeGlobalFile(os.path.join(work_dir, 'sample.bam.bai'))

How do you handle all the tools having different sets of parameters?

Arguments are just file inputs and other necessary parameters (like cores for multithreading). The one thing I haven't really ironed out is the best way to make the parameters to the tool configurable, but that's not an issue that's being created by this proposal. I'm thinking since we've moved to YAML configs, we could try storing the parameters into the config, but it's a bit tricky since the parameters are tweaked in every function to reflect input file names, dynamic # of cores, etc.

An example for MuTect.

def run_mutect(job, normal_bam, normal_bai, tumor_bam, tumor_bai, ref, ref_dict, fai, cosmic, dbsnp):

A single package per tool type does seem reasonable though. E.g., toil.tools.aligners is a package that contains the toil.tools.aligners.bwa, toil.tools.aligners.star, etc. modules.

That's a great idea.

I would suggest toil_scripts.tools.x.y where x is a tool type ("aligner", "quantifier", etc) and y is a specific tool ("RSEM", "Kallisto").

+1 — @hannes-ucsc @arkal @jpfeil, opinions on this format type?

fnothaft commented 8 years ago

What do you (@jvivian) and @arkal define "a pipeline" to be?

My operating definition is that a pipeline = a workflow = a stitched together DAG of jobs. Given that definition, I don't think of STAR or BWA or etc as jobs because they're composed of multiple different jobs (e.g., download_sample, download_indices, align_reads, etc).

We have the same definition. STAR and BWA aren't part of a pipeline until someone does the wiring. They can be defined as stand alone functions with inputs that are pipeline-independent, with a return statement for the output of STAR that is also pipeline generic.

Ah, I'm not sure if we're on the same page here. In my world, since BWA is a DAG of jobs, BWA is a pipeline.

I was striving for arguments to the job functions that are pipeline independent. Generally, I've been making the arguments be FileStoreIDs of the inputs and nothing else. No vague / enigmatic "config" or "inputs" objects/dicts. E.G.

That might not be sufficiently general for tools that need a globally mounted filesystem, instead of a global file store. For instance, ADAM needs S3 or HDFS paths, right now. We could probably make this work though. Not sure I've got a solution, but I'll think about it.

jvivian commented 8 years ago

Ah, I'm not sure if we're on the same page here. In my world, since BWA is a DAG of jobs, BWA is a pipeline.

Look at the BWA pipeline — BWA(kit) is defined in a single job function.

That might not be sufficiently general for tools that need a globally mounted filesystem, instead of a global file store. For instance, ADAM needs S3 or HDFS paths, right now. We could probably make this work though. Not sure I've got a solution, but I'll think about it.

Enforcing those path types could be part of the pipeline's responsibility. Once you've generalized and exported the tool job functions, the "pipeline" becomes little more than a wrapper to handle job wiring, promise resolution, CLI, and require() / user input handling.

fnothaft commented 8 years ago

Ah, I'm not sure if we're on the same page here. In my world, since BWA is a DAG of jobs, BWA is a pipeline.

Look at the BWA pipeline — BWA(kit) is defined in a single job function.

Ah, got it now! So to parrot things back then, what you're proposing is that the exact functional stage that transforms some filestore IDs into new analyzed filestore IDs is moved into a "tools" package.

How would this work for tools that make multiple analysis passes? E.g., IndelRealignment or BQSR in the GATK. Would we enforce that those would be written as a single job?

jvivian commented 8 years ago

Ah, got it now! So to parrot things back then, what you're proposing is that the exact functional stage that transforms some filestore IDs into new analyzed filestore IDs is moved into a "tools" package.

image

How would this work for tools that make multiple analysis passes? E.g., IndelRealignment or BQSR in the GATK. Would we enforce that those would be written as a single job?

I've thought about this and bit, and my personal take on it is that if it is unlikely, or outside of our caring, that someone would only want to run a part of that 4-step GATK pre-processing, then it should be supplied to the user as "gatk_preprocessing", with RTC dynamically calling IR which dynamically calls BQSR which dynamically calls PrintReads which returns the processed BAM.

fnothaft commented 8 years ago

How would this work for tools that make multiple analysis passes? E.g., IndelRealignment or BQSR in the GATK. Would we enforce that those would be written as a single job?

I've thought about this and bit, and my personal take on it is that if it is unlikely, or outside of our caring, that someone would only want to run a part of that 4-step GATK pre-processing, then it should be supplied to the user as "gatk_preprocessing", with RTC dynamically calling IR which dynamically calls BQSR which dynamically calls PrintReads which returns the processed BAM.

In my view of a "just world", no one ever runs INDEL realignment ever again (conditioned on them using a reassembly based variant caller --> e.g., HaplotypeCaller; you somatic people using IR + MuTect are still OK in my book). But yes, if what you're saying is that we don't package up the GATK as a "tool" (and we just package it as a pipeline), that is OK by me.

However, my point still stands. I'm not 100% sure about Kallisto, but other alignment-free quantifiers (thinking of Sailfish, for a sec) are two-pass (Index then Quantify). How do we accommodate those tools? Do we? Do we force them into a single Job?

jvivian commented 8 years ago

However, my point still stands. I'm not 100% sure about Kallisto, but other alignment-free quantifiers (thinking of Sailfish, for a sec) are two-pass (Index then Quantify). How do we accommodate those tools? Do we? Do we force them into a single Job?

IMO, if the indexing step has different input requirements than the tool (like in Kallisto's case), they are separate job functions. This is just an example, because people are going to have opinions on return format (here I'm returning a FileStoreID of a tarball with Kallisto's outputs), and whether or not arguments should always be FileStoreIDs and not something like a URL. Sometimes it's convenient to roll steps into one functions — like a simple faidx if the .fai file isn't present and which doesn't require any other inputs.

def run_kallisto(job, cores, r1_id, r2_id, kallisto_index):
    """
    RNA quantification via Kallisto

    :param JobFunctionWrappingJob job: passed automatically by Toil
    :param int cores: Number of cores to run Kallisto with
    :param str r1_id: FileStoreID of fastq (pair 1)
    :param str r2_id: FileStoreID of fastq (pair 2 if applicable, otherwise pass None for single-end)
    :param str kallisto_index: FileStoreID for Kallisto index file
    :return: FileStoreID from Kallisto output
    :rtype: str
    """
    work_dir = job.fileStore.getLocalTempDir()
    # Retrieve files
    parameters = ['quant',
                  '-i', '/data/kallisto.idx',
                  '-t', str(cores),
                  '-o', '/data/',
                  '-b', '100']
    job.fileStore.readGlobalFile(kallisto_index, os.path.join(work_dir, "kallisto.idx"))
    if r1_id and r2_id:
        job.fileStore.readGlobalFile(r1_id, os.path.join(work_dir, 'R1_cutadapt.fastq'))
        job.fileStore.readGlobalFile(r2_id, os.path.join(work_dir, 'R2_cutadapt.fastq'))
        parameters.extend(['/data/R1_cutadapt.fastq', '/data/R2_cutadapt.fastq'])
    else:
        job.fileStore.readGlobalFile(r1_id, os.path.join(work_dir, 'R1_cutadapt.fastq'))
        parameters.extend(['--single', '-l', '200', '-s', '15', '/data/R1_cutadapt.fastq'])
    # Call: Kallisto
    docker_call(tool='quay.io/ucsc_cgl/kallisto:0.42.4--35ac87df5b21a8e8e8d159f26864ac1e1db8cf86',
                work_dir=work_dir, parameters=parameters)
    # Tar output files together and store in fileStore
    output_files = [os.path.join(work_dir, x) for x in ['run_info.json', 'abundance.tsv', 'abundance.h5']]
    tarball_files(tar_name='kallisto.tar.gz', file_paths=output_files, output_dir=work_dir)
    return job.fileStore.writeGlobalFile(os.path.join(work_dir, 'kallisto.tar.gz'))
fnothaft commented 8 years ago

OK, this sounds prudent to me. Essentially, everything in toil_scripts.tools will just be the rote job functions that invoke a single stage of a job, and then there will be a variety of pipelines (which we might want to move into toil_scripts.pipelines or something) that will wire together the toil_scripts.tools job functions.

I still stand by not moving these into toil-lib.

arkal commented 8 years ago

I'm arkal and I (think) I approve of this message (or the TL;DR version John gave me).

To see a glimpse of what I'm doing, you can check out https://github.com/BD2KGenomics/protect/tree/issues/32-add-muse-support/src/protect

hannes-ucsc commented 8 years ago

@arkal you might want to link to a specific line

arkal commented 8 years ago

@hannes-ucsc The link was to show the structure of protect.x.y. similar to what @fnothaft had to say