ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
109 stars 33 forks source link

Recommendations to process large cohort of normal/tumoral RNASEQ pair #115

Closed ZheFrench closed 9 months ago

ZheFrench commented 9 months ago

I have around ~ 130 pairs of normal/tumoral RNASEQ. I was wondering what could be the best settings to run your software especially for the reference normal.

You advise to use 2-10 samples but I could actually have a lot more. Is this useful ? Also does it will impact memory and running time ?

Update : When you have tumoral and matched normal, you can just use the matched ref sample for reference, right ?

What would be the best approach ? :

I have got 80 cores with a total ~ 800 GO but no job scheduler...

ChristofferFlensburg commented 9 months ago

Hi!

Reference normals are used to filter variants (variants present in the reference will not be called as somatic viarants), and are used as diploid reference for read counts for CNA calling. Having more reference normals will catch more recurring noise for variants (such as sequencing artefacts around repeat regions, and for RNA-Seq, alignment issues around splice sites), and will give a better baseline and variance estimate for the CNA calling. Cost is that more reference normals use more memory and cpu-time.

Avoid using the matched normal as a reference normal if you can, instead submit the matched normal together with the tumor and enter the same INDIVIDUAL in the meta data and YES in the NORMAL column for the matched normal, and superFreq will use it as a matched normal.

For RNA-Seq, filtering is probably more important, while memory and cpu use is typically lower, so more reference normals make sense. However, it also depends a lot on your data, in particular if all your samples are quite uniform in quality (all processed in the same way, no batch effects), fewer reference normals are needed. For in-house batches with alld ata processed together, I've run RNA-Seq with the minimum 2 reference normals, and it's worked out fine, while for TCGA data I've been using up to 30 reference normals as they are sometimes processed in different places.

So in your case, I think it depends a lot on if your data has batch effects. If it's all one batch, then picking 10-30 random ones works (10 probably should be plenty), but try to avoid low quality reference normals if you already have some QC on the samples. You will need to have at least two sets of normals, to avoid running tumors with its matched normal as reference. If your data is in batches, then using reference normal from the same batch is probably a better idea, even if it means fewer reference normals.

In general, for RNA-Seq it's important that the reference normals are from a tissue type as close as possible to the tumors, as in, ideally their gene expression profiles should match as closely as possible. That's important for both variant filtering (can't filter if the reference normal don't express the gene), and for CNA calling (harder to tell if there are more reads due to CNA gain when lots of genes have very different read counts due to expression). So if you for example have solid tumors with blood normals, then might be better to find other samples from the tumor tissue to use as reference normals, even if sequenced elsewhere.

80 cores and 800GB is plenty. An RNA-Seq run typically goes through in a couple hours with 4 cpus, using maybe 20GB or so I think. But it depends a lot on how many variants you give it in the input VCF, and it can vary from individual to indivudal, so might have to rerun outliers with more memory. Do a few single runs first as test (also that'll cache variants for the reference normals, speeding up later runs), but you should be fine for resources.

There is also some support for cohort level analysis in superFreq, after all the indivudal runs are through. Put them all in the same input metaData.tsv, and use the same plots and R directories, and then when everything is through you can run runSummaryPostAnalysis which will give you some neat output, such CNA heatmap, mutation matrix, and the invaluable relatedness heatmap showing if the tumors and matched normals are actually from the same indivudal.

Have a look at the readme on github, there is some good information in ?superFreq on what the input should be, and might be worth having a look at the RNA-Seq paper as well for what to expect, if you haven't already.

Good luck!

ZheFrench commented 9 months ago

Thank you for this detailed answer. The rnaseq has been processed in the same place.

This is tricky to remove the matched normal from the reference directory. I do understand your idea of two reference directories but i'm affraid of a kind of two batch effect with this approach.

The perfect approach would be to create a random 10 samples reference directory for each patient avoiding to select the matched normal in the set.

As also it uses a lot of cpu-time, I wanted to use xgars/parallel to process more quiclky but that would be complicated because you need to create a dir each time with links of randomly selected normal samples.

I will think about it and give a try.

ChristofferFlensburg commented 9 months ago

Just a note on parallelising: I'd avoid parallelising across individual inside R (with for example parallel), as R struggles to keep the memory separate to each thread. It'd risk having two parallel superFreq runs use 4x the memory. Better to parallelise across different instances of R.

ZheFrench commented 9 months ago

Yeah I'm using a two-step approach. This creating a lot of data on disk but anyway. I give you my logic if that can help others.

I start from the listing of my sam files, so I need first to create the bam, sort and index them, create the vcf. (~2hours,tha is the long part)..I process them 10 by 10.

cat /sampleCNV-all.csv | xargs -P 10 -n 1 Rscript code/cnv.R Then I create the metada from the content of the directory where all files bam/vcf are stored.

Then for each tumor sample, I will create a directory. I select randomly 10 samples among the samples without the normal matched sample. I create symlinks for bam and index. I run this in parallel using xargs 3 by 3 givint the sample in participants and normalDir with 10 differently randomized normal samples for each tumor participant.

tail +2 metadata.tsv | grep tumoral | xargs -d '\n' -P 3 -n 1 Rscript cnv_superFreq.R

I did this for a few set. Now actually, I'm struggling to understand the output. There is a lot of files. I don't know which one to chose.

By the way, do you have a routine to estimate chromosome/arm level CNV ? For example if i want to check a gain(1q) ?

I will quickly test

runSummaryPostAnalysis

to see what it gives. Maybe I'll come back to ask more questions :)

ChristofferFlensburg commented 9 months ago

If you want to do downstream cohort analysis, then better to use the same R directory and plots directory in the superFreq() call for all individuals,as well as use a single metaData.tsv input file with meta data for all individuals and specify the indivudal you want to run in the superFreq() call. It will create subdirectories for each individual, and you can later give the top level R and plots directories with the metadata file to the cohort analysis and it'll do everything on its own.

I've never found a reason to reduce copy number data to chromosome arm level. I know it's common especially in clinical settings where karyotyping is still used, and many events are whole chr arms, but we have higher resolution (typically around 10Mbp for RNA-Seq, as you can see in the paper), and the actual question is usually if gene X is affected by a copy number. In the cohort analysis, you can give your favourite genes as input (genes of interest, GoI), or it'll automatically select the most frequently mutated COSMIC census mutations in your cohort, and it'll give you CNA status over those genes. There should also be a meanCNV plot, showing which regions of the genome are frequently affected by CNAs across the cohort.

good luck, let me know how it goes.

ZheFrench commented 9 months ago

Thank you for your answer. I will take time to look a little bit in the outputs. My final goal is to cluster my patient using CNV calls but for now, I don't see what to use. I'm really not familiar with this kind of analysis so sorry for the dumb questions and the long text :/

QUESTIONS ABOUT THE OUTPUT AND THE PLOTS :

Maybe it's described elsewhere and I didn't see it.

The plots in ./CNV dir are informative but I'm struggling to understand well the size and color or points. Also In clonality what's the green, blue and horizontal yellow bar ? captureRegions : What does the red black and blue colors mean ? CNVcalls : What are is the meaning of the different color in the plot ? combined : both of this (there is also the same graphic in the upper directory called with the name of the sample)

In cohortWide direcory, You generate CNAsummary.pdf where you paired tumoral/normal. That's cool. Where is the produced results data, you use for that ? I would like to have once for all my cancer samples and one for all the matched normals. I think going through the code to modify runSummaryPostAnalysis and adapt for new plots.

Finally where is the file where you find the information about the genes that are duplicated n times or deleted ?

In data direcory, what do the columns of CNAbyGene_* means ?

In cohort direcory,I got plenty of empty directories ....I removed the temp dir of normals. Do you need this also ? Because it's making a lot of symlinks...

QUESTIONS BEHIND THE LOGIC OF THE ALGORITHM :

I'm actually trying to read in depth the paper to understand better. I have a few questions on the different steps.

SuperFreq monitors B-allele frequency and shifts in coverage compared to the reference normals. What is your definition of B allele frequency (BAF) here ? Same as MAF (minor allele frequency) ?

1- SNV filtering

Variants are called from the normal matched sample and then they are also compared to the reference normal samples. This will serve to select germline hererozygous SNps.

2-Somatic SNV calling

SuperFreq calls somatic variants by comparing to the match normal if present. If the sample is marked as normal, then somatics are called as if no matched normal is present using databases as cosmic I presume. This somatic SNV call is not used in the CNA vall, right ?

Variants present in the reference normals are removed from the analysis of somatic SNVs, but common population polymorphisms are retained for CNA calling.

3- LFC (coverage)

The coverage of each cancer sample is compared to the coverage of the reference normal samples. (here I do understand that the normal matched sample is not used, right ?) Yet I saw subread was reading the two bams for normal and tumoral. what the purpose of this.

4- Germline heterozygous SNPs

Germline heterozygous SNPs are identified from the SNVs that passed the quality filters. This step is related to step 1, right ? Then the algorithm in superFreq that determines if a CNA region is deviating from the null hypothesis of 50% allelic balance using these SNPs positions. True ? You do that for cancer sample and also for the matched normal ? (so coverage LFC is also computed against the normals here to get the CNA, most of the time you expect to no events for the normal, right ?)

5- Finally, the genome is segmented into copy number regions based on the coverage LFC and BAF. The capture regions for each gene are merged and hierarchical clustering is performed. The most similar adjacent segments are merged recursively, with a distance measure comparing LFC and BAF. The ploidy of the sample is then determined from the ratio of the coverage LFC with respect to the reference normals. Different segments are assigned clonality (sample cellular fraction) and copy number call independently.

ChristofferFlensburg commented 9 months ago

Hi!

Happy to see that it's running and you're getting output!

I think some of these questions are covered in the manual (in the "manual" directory here in github repo). It's old, but most the information on the output should still be accurate. I'll try to answer some questions on the algorithm, and see if some of the others are in manual, otherwise ask again if still relevant.

BAF in superFreq is the allele frequency of the allele that is below 50% in the studied sample. MAF in data bases like dbSNP or exac usually refers to the lower population frequency. In superFreq code and annotation im a bit sloppy and BAF and MAF are used interchangeable, but for superFreq analysis, it's referring to the lower VAF, ie a germline het variant with 65% VAF is assigned a 35% BAF in the CNA analysis.

2: somatic variants in samples without matched normals are called against dbSNP and exac, so just variants that are not flagged from the QC part, and are not present or at low MAF in population databases. This will call rare germline variants as somatic variants, which is somewhat mitigated by the clonal tracking, where variants that are at consistent with 100% clonality in all samples of an individual (so the tumor and the matched normal in your case) will get a "germline" flag in the somaticVariants output.

The variants used in CNA calling are variants that make it through QC filtering, have high MAF in population databases (>1% if I recall correctly), and have a VAF between 5% and 95%. The VAF cut is a compromise to catch heterozygous variants even if there is a high purity LOH event (which can shift VAF towards 0 or 1 depending on purity of the event) and not catch too much noise from low VAF, or catch homozygous variants too close to VAF=1. This removes sensitivity to CNN LOH present in > 90% of the cells in samples without matched normal, as the heterozygous germline variants will be mostly outside of the 5%-95% range, but I added in a cute lite rescue mechanism where it's looking for regions with a significant absence of heterozygous variants, and calls those as LOH. We've had a couple cute cases where that found things we would've missed otherwise, and it can also capture regions of LOH in the normal germline (this can spot consanguineous germline variants/regions, so beware on what the consent and ethics are on looking at germline variants in your cohort).

  1. subread is used to get read counts from each gene from the studied samples and the reference normals, which are used in the LFC part of copy number analysis, where the LFC is for each sample compared to all the reference normals in a 1 vs many comparison. That LFC is essentially the top panel of the CNA plots. SuperFreq runs copy number calls on both the normal and the tumor, so need counts from both samples. The counts from the reference normals are saved to file, so second run with same reference normal directory wont have to rerun the reference normal subread, the counts will be loaded in from previous run.

  2. Accurate, also addressed above.

  3. Yep. The clonality and copy number call for each are assigned simultaneously from the LFC and BAF. The maypole plots (fig 7 in manual) is the best way to understand that. Each segment is treated separately though (unlike most other CNA callers), and are only grouped into clones later, with help of the somatic SNVs, so that they can share information with each other.

Have a look at the manual as well, and let me know what information is missing, good luck!