Somatic update - Githubissues

eudesbarbosa commented 2 years ago

Tasks

Tasks extracted from PR #94 :

Based on commit 0a30513f360b82828efeed6835f362df999f776a

[ ] Unify output filenames for Control Freec and CopyWriter.
[ ] Consolidate GRCh37 and GRCh38.

Based on commit e3be244584549cb2f623e4f4e0112b1643376fc0

[x] Output both Full and Filtered VCF files for mutect2.
[x] Rename 'common' to 'biallelic' variants - follow GATK convention.

Based on commit 1a451614bf10c840f2e5566463078c7cb40f16c0

[ ] Remove parallel somatic variant annotation.
[x] Catch multiple filter conditions in eb_filter.
[ ] Log conda information used for jannovar wrapper.

Based on commit c2594449df3f9ecff92a04e0b6f975d061cf38ce

[x] Expand list of result expansion files in snappy_pipeline/workflows/somatic_variant_calling/__init__.py - mutect2 related.

Based on commit 18b83627a204e8ec077dd93d4d60a16e4158c91f

[x] Fix log file in wrapper snappy_wrappers/wrappers/mutect2/contamination/wrapper.py
[x] Fix log file in wrapper snappy_wrappers/wrappers/mutect2/filter/wrapper.py

~~Based on commit c46917387b3552c08113b6fab9e88e88c18c6ebf~~

~~Update cbioportal_export default config.~~
~~Fix wrapper snappy_wrappers/wrappers/cbioportal/case_lists/wrapper.py.~~
~~Fix wrapper snappy_wrappers/wrappers/cbioportal/meta_files/wrapper.py.~~
~~Fix wrapper snappy_wrappers/wrappers/vcf2maf/vcf2maf/wrapper.py.~~

Based on commit d407f2075ffe387988ae2e72d6b27b23e3746606

[x] Disable collect target coverage when no target coverage files requested.

Based on commit f062fc5fca06e846eb6189b87476d21c2928ae36

Miscellaneous - hard to pin, probably just debugging.

eudesbarbosa commented 2 years ago

For 'Unify output filenames for Control Freec and CopyWriter': one should also consider unifying cnvetti_on_target_postprocess output. Currently using _ to separate the file extension:

output/bwa.cnvetti_on_target_postprocess.P00{i}-T{t}-DNA1-WGS1/out/bwa.cnvetti_on_target_postprocess.P00{i}-T{t}-DNA1-WGS1_{ext}

Line 290 - method CnvettiStepPartBase._get_output_files_postprocess()

eudesbarbosa commented 2 years ago

I don't understand the changes to snappy_wrappers/wrappers/control_freec/transform/snappy-convert-control_freec.R: -> They seem to be the reverse of the changes to the workflow.

--                                   ratios_fn=paste("freec.", sample_name, ".ratio.txt", sep=""),
--                                   log2_fn=paste("freec.", sample_name, ".log2.txt", sep=""),
--                                   call_fn=paste("freec.", sample_name, ".call.txt", sep=""),
--                                   segments_fn=paste("freec.", sample_name, ".segments.txt", sep=""),
+-                                   ratios_fn=paste("freec.", sample_name, "_ratio.txt", sep=""),
+-                                   log2_fn=paste("freec.", sample_name, "_gene_log2.txt", sep=""),
+-                                   call_fn=paste("freec.", sample_name, "_gene_call.txt", sep=""),
+-                                   segments_fn=paste("freec.", sample_name, "_segments.txt", sep=""),

For the function postProcess in snappy_wrappers/wrappers/copywriter/call/snappy-copywriter-call.R: -> Can't the outputs be an argument, similar to control_freec_write_files, so they don't need to be agnostic towards the rest of the code? Example:

--                         paste( mapper, ".copywriter.", fullID, "_segments.txt", sep="" ) ),
+-                         paste( mapper, ".copywriter.", fullID, ".segments.txt", sep="" ) ),

eudesbarbosa commented 2 years ago

Another points about control_freec_write_files, it seems to have default values but the code will stop if files are not accessible. Are defaults really necessary? Can't they just be defined/taken from the wrapper?

ericblanc20 commented 2 years ago

I am afraid I have created a monster here. I should have stuck to a naming convention for all CNV steps:

copywriter in somatic_targeted_seq_cnv_calling,
CNVkit in somatic_targeted_seq_cnv_calling & somatic_wgs_cnv_calling,
Control-FREEC in somatic_wgs_cnv_calling
cnvetti (germline)

You are welcome to unify the naming convention.

On a side note, I found that output filenames are configured differently in different places,even within the same step & tool (for example names are constructed with underscore and with dot).

I don't know which is the right way to go. Ideally, I would like to:

merge the somatic_targeted_seq_cnv_calling with the somatic_wgs_cnv_calling (there is no logical need to keep them apart, as the WES & WGS somatic snvs & indel calling are grouped in a single step)
unify using the dot as separator in filenames, the same way that the tools are separated.

This is probably a lot of unnecessary work, but perhaps we should consider moving to dot separation across the somatic CNV steps.

ericblanc20 commented 2 years ago

Regarding the control_freec_write_files, I assume you are referring to the Bioconductor packages org.Hs.eg.db, TxDb.Hsapiens.UCSC.hg19.knownGene & BSgenome.Hsapiens.1000genomes.hs37d5.

This is a real problem, because these packages are huge (genome sequence, genome feature annotations, gene ids & functional annotations), and they are only valid for one genome release. If we want to use GRCh38, or mouse data, then we need to use other version of these packages.

The problem comes as the packages are downloaded using the wrapper's conda environment. So the only solution (at the moment) is to put in the conda environment annotation packages for several genome releases and species.

We discussed with Clemens a way around this problem. We would have an initial sub-step which creates a sub-directory in $PWD/work in which required packages are installed. This would then be used as supplementary library path for the R scripts, provided by the wrapper. I have a pretty good idea how we could do that, but I need a few days to work on it.

eudesbarbosa commented 2 years ago

Name pattern: Probably a good idea to stick to one across all workflows. I think I went over all the ones present in the original PR, but it might be something to gradually fix.

Workflow distinction: I suggest we keep the separation. Too much work for very little value.

control_freec_write_files: Way beyond my knowledge of the topic. I was just referring to the fact that there seems to be some default values for the output arguments - in this case you needed to modify them, you might consider have no defaults and avoid that:

--                                   ratios_fn=paste("freec.", sample_name, ".ratio.txt", sep=""),
+-                                   ratios_fn=paste("freec.", sample_name, "_ratio.txt", sep=""),

ericblanc20 commented 2 years ago

Sorry I misunderstood your comment. You are absolutely right: ratios_fn should be obtained from the wrapper.

bihealth / snappy-pipeline

Somatic update #156