DKFZ-ODCF / COWorkflowsBasePlugin

A basic Roddy plugin for computational oncology workflows
MIT License
2 stars 2 forks source link
ngs roddy

Description

The COWorkflowsBasePlugin (Computational Oncology Workflows Base Plugin for Roddy) provides some general classes and framework for some of the other Roddy plugins. This includes both, JVM-based code (Java, Groovy) as well as command line tools used in cluster jobs.

Run flags / switches

Hints

Alignment folder

The alignment folder is referenced several times. For the plugin to work, it is currently necessary to have a folder for your dataset like e.g.:

/tmp/[dataset_id]

Inside this, you will need to create the alignment subfolder:

/tmp/[dataset_id]/alignment

And inside this, you may have to to place or link your merged bams (dependent on the workflow), e.g.:

/tmp/[dataset_id]/alignment/[sample_id]_[dataset_id]_merged.rmdup.bam
/tmp/[dataset_id]/alignment/[sample_id]_[dataset_id]_merged.rmdup.bam.bai

It should be possible to just link the files in there.

So whenever we speak of the alignment folder, it is basically the described structure. You can change the alignment folder by overriding in your xml:

<cvalue name='alignmentOutputDirectory' value='alignment' type="path"/>

For classes extending WorkflowUsingMergedBams:

Switch Default Description
isNoControlWorkflow false Set to true to allow this workflow to work without a control bam file.
workflowSupportsMultiTumorSamples false Allow the workflow to run with several tumor bam files. This is done with a for loop (see code documentation in WorkflowUsingMergedBam)

For sample extraction from filenames:

To extract samples from filenames, multiple methods exist or are planned. You can control the workflows behaviour with the variable "selectSampleExtractionMethod".

Valid values with their control variables are:

Switch Value Description
selectSampleExtractionMethod version_1 (Default) The old version for sample from file extraction.
selectSampleExtractionMethod version_2 The new version.

"version_1"

This one is very (too) simple and just splits the filename on underscores '_'. Afterwards, it takes the first splitted value and uses it as the sample name. Further control is possible with:

Switch Default Description
enforceAtomicSampleName false Defines whether the method shall append '' to the search pattern. The method searches then e.g. for 'control' or 'tumorsomething'

Please take a close look at the file SampleFromFilenameExtractorVersionOneTest to see a table of filenames and expected samples.

Note that, in contrast to version2, this method does not take the configured samples in possibleControlSampleNamePrefixes and possibleTumorSampleNamePrefixes into account and will return any file prefix separated by "". So you should not have underscores in your sample names.

"version_2"

The method is quite complex and can detect a variety of samples. The basic settings will use the samples set in possibleControlSampleNamePrefixes and possibleTumorSampleNamePrefixes as prefixes for the sample search. E.g. "con" will extract "control" from "control_some_merged.bam" and "control02" from "control02_some_merged.bam". Like in version1, "\" is used as a delimiter for the extraction. Note that, in contrast to version1, samples may contain "\" delimiters in their name! A sample prefix like "control_sample" will work.

Before the sample is extracted, both possible... lists are joined and sorted in a reverse order. Let's say you have:

    possibleControlSampleNamePrefixes=( control control02 control_sample )
    possibleTumorSampleNamePrefixes=( tumor xeno tumor_02 )

you will get the following list for the extraction:

    xeno
    tumor_02
    tumor
    control_sample
    control02
    control

We do this to search for the most specific sample prefix first, otherwise in the case above, control would be preferred over the more specific control_sample or control02.

You can modify the search behaviour with several switches:

Switch Default Description
matchExactSampleNames false If set, the sample will be extracted like they are set in the config. This is compatible with allowSampleTerminationWithIndex.
allowSampleTerminationWithIndex true Allow recognition of trailing integer numbers for sample names, where the index may be separated by an underscore from the prefix, e.g. both "tumor02" and "tumor_02" would be matched with "possibleTumorSampleNamePrefixes=tumor".
useLowerCaseFilenamesForSampleExtraction true The switch will tell the method to work on lowercase filenames. Filenames are first converted to lower case before matching.

Please take a close look at the file SampleFromFilenameExtractorVersionTwoTest. There is a large test case "Version_2: Extract sample name from BAM basename", which features a table with inputs, switches and expected output.

    matchExactSampleName=false
    allowSampleTerminationWithIndex=true
    useLowerCaseFilenameForSampleExtraction=true

Note that these are the default settings for the version_2 algorithm.

If you want just exact matching to the names in the possible(Tumor|Control)SampleNamePrefixes you can use

    matchExactSampleName=true
    allowSampleTerminationWithIndex=false
    useLowerCaseFilenameForSampleExtraction=false

Also note that there is a variable calle searchMergedBamWithSeparator, which defaults to "true".

        <cvalue name='searchMergedBamWithSeparator' value='true' type="boolean"/>

It determines whether the sample-name is separated from the patient identifier with an underscore "_". Leave this value set to "true" also with matchExactSampleNames, because otherwise you could still find more than one BAM file when they share the same prefix (e.g. "tumor" was extracted but it will match for "tumor" and "tumor03" during the BAM file search.

"regex" (planned)

Not implemented, but planned.

For sample extraction from the alignment directory

Switch Default Description
extractSamplesFromOutputFiles false If this is true and samples are neither passed by metadata table, configuration or sample list, samples are extracted from files in the alignment folder.
extractSampleNameOnlyFromBamFiles false By default, the method will search for samples in all files in the alignment directory. With this switch, you can restrict it to BAM files.

Changelist

Changelist of COWorkflowsPlugin