bihealth / snappy-pipeline

SNAPPY Nucleic Acid Processing in Python
MIT License
8 stars 4 forks source link

Somatic update #156

Open eudesbarbosa opened 2 years ago

eudesbarbosa commented 2 years ago

Tasks

Tasks extracted from PR #94 :

Based on commit 0a30513f360b82828efeed6835f362df999f776a

Based on commit e3be244584549cb2f623e4f4e0112b1643376fc0

Based on commit 1a451614bf10c840f2e5566463078c7cb40f16c0

Based on commit c2594449df3f9ecff92a04e0b6f975d061cf38ce

Based on commit 18b83627a204e8ec077dd93d4d60a16e4158c91f

Based on commit c46917387b3552c08113b6fab9e88e88c18c6ebf

Based on commit d407f2075ffe387988ae2e72d6b27b23e3746606

Based on commit f062fc5fca06e846eb6189b87476d21c2928ae36

eudesbarbosa commented 2 years ago

For 'Unify output filenames for Control Freec and CopyWriter': one should also consider unifying cnvetti_on_target_postprocess output. Currently using _ to separate the file extension:

output/bwa.cnvetti_on_target_postprocess.P00{i}-T{t}-DNA1-WGS1/out/bwa.cnvetti_on_target_postprocess.P00{i}-T{t}-DNA1-WGS1_{ext}

Line 290 - method CnvettiStepPartBase._get_output_files_postprocess()

eudesbarbosa commented 2 years ago

I don't understand the changes to snappy_wrappers/wrappers/control_freec/transform/snappy-convert-control_freec.R: -> They seem to be the reverse of the changes to the workflow.

--                                   ratios_fn=paste("freec.", sample_name, ".ratio.txt", sep=""),
--                                   log2_fn=paste("freec.", sample_name, ".log2.txt", sep=""),
--                                   call_fn=paste("freec.", sample_name, ".call.txt", sep=""),
--                                   segments_fn=paste("freec.", sample_name, ".segments.txt", sep=""),
+-                                   ratios_fn=paste("freec.", sample_name, "_ratio.txt", sep=""),
+-                                   log2_fn=paste("freec.", sample_name, "_gene_log2.txt", sep=""),
+-                                   call_fn=paste("freec.", sample_name, "_gene_call.txt", sep=""),
+-                                   segments_fn=paste("freec.", sample_name, "_segments.txt", sep=""),

For the function postProcess in snappy_wrappers/wrappers/copywriter/call/snappy-copywriter-call.R: -> Can't the outputs be an argument, similar to control_freec_write_files, so they don't need to be agnostic towards the rest of the code? Example:

--                         paste( mapper, ".copywriter.", fullID, "_segments.txt", sep="" ) ),
+-                         paste( mapper, ".copywriter.", fullID, ".segments.txt", sep="" ) ), 
eudesbarbosa commented 2 years ago

Another points about control_freec_write_files, it seems to have default values but the code will stop if files are not accessible. Are defaults really necessary? Can't they just be defined/taken from the wrapper?

ericblanc20 commented 2 years ago

I am afraid I have created a monster here. I should have stuck to a naming convention for all CNV steps:

You are welcome to unify the naming convention.

On a side note, I found that output filenames are configured differently in different places,even within the same step & tool (for example names are constructed with underscore and with dot).

I don't know which is the right way to go. Ideally, I would like to:

This is probably a lot of unnecessary work, but perhaps we should consider moving to dot separation across the somatic CNV steps.

ericblanc20 commented 2 years ago

Regarding the control_freec_write_files, I assume you are referring to the Bioconductor packages org.Hs.eg.db, TxDb.Hsapiens.UCSC.hg19.knownGene & BSgenome.Hsapiens.1000genomes.hs37d5.

This is a real problem, because these packages are huge (genome sequence, genome feature annotations, gene ids & functional annotations), and they are only valid for one genome release. If we want to use GRCh38, or mouse data, then we need to use other version of these packages.

The problem comes as the packages are downloaded using the wrapper's conda environment. So the only solution (at the moment) is to put in the conda environment annotation packages for several genome releases and species.

We discussed with Clemens a way around this problem. We would have an initial sub-step which creates a sub-directory in $PWD/work in which required packages are installed. This would then be used as supplementary library path for the R scripts, provided by the wrapper. I have a pretty good idea how we could do that, but I need a few days to work on it.

eudesbarbosa commented 2 years ago

Name pattern: Probably a good idea to stick to one across all workflows. I think I went over all the ones present in the original PR, but it might be something to gradually fix.

Workflow distinction: I suggest we keep the separation. Too much work for very little value.

control_freec_write_files: Way beyond my knowledge of the topic. I was just referring to the fact that there seems to be some default values for the output arguments - in this case you needed to modify them, you might consider have no defaults and avoid that:

--                                   ratios_fn=paste("freec.", sample_name, ".ratio.txt", sep=""),
+-                                   ratios_fn=paste("freec.", sample_name, "_ratio.txt", sep=""),
ericblanc20 commented 2 years ago

Sorry I misunderstood your comment. You are absolutely right: ratios_fn should be obtained from the wrapper.