LCR-BCCRC / lcr-modules

Collection of standard analytical pipelines for genomic and transcriptomic data
https://lcr-modules.rtfd.io
MIT License
15 stars 7 forks source link

generate_sets #174

Open rdmorin opened 3 years ago

rdmorin commented 3 years ago

We currently have a function in oncopipe (generate_pairs) that handles the complex task of matching up tumours and their matched normal samples. I think we need to implement a similar function in oncopipe that allows more complex sample sets to be automatically constructed with other groupings (not always 1:1). An example use case I've encountered for cases with more than one tumour is that we want to run some analyses/tools on ALL the tumour bams (or tumour_mafs etc), so we need to know the sample ID of each sample that exists for that patient and have them grouped properly. Something along the lines of:

op.generate_sets(SAMPLES,sample_types=('tumour_genome','normal_genome',grouping='patient')

This could return a data frame with a column for each tumour genome that exists (named by the corresponding time point from a time_point column) or perhaps the tumour_genome column contains a list of genomes where more than one exists.

Another use case: op.generate_sets(SAMPLES,sample_types=('tumour_genome','tumour_mrna',grouping='tumour_sample')

This would return a data frame that has a column for the tumour genome and another for the RNA-seq sample (or samples) for that patient. In this case, I've indicated that grouping would be at the level of the sample instead of the patient, so if there are multiple time points, these would still be in separate rows.

lkhilton commented 3 years ago

@oncogenomics I've made a handy reprex for you to mess around with to get the gist of what we're trying to accomplish. You should have permissions to be able to run this directly:

cd /projects/rmorin_scratch/Laura_temp/oncopipe_sandbox
snakemake -np -s test_generate_pairs.smk all

Essentially we have results files that have been generated by retrieving wildcards from a "runs" table, which is generated from the input samples table (in my example the maf files fit this description). We might also have input files that aren't paired. We want to be able to easily write rules that take outputs from different pipelines and joins them on patient_id and/or surgical number.

The issue with the way I've done it is that not every RNAseq sample has a genome, so some of the rules generated on the dry run have an input bam but no input maf files. Ideally the generate_sets function would handle this and only set mRNA bam files that ALSO have one or more genome maf files as targets.