ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
110 stars 33 forks source link

Explanation of fields in rivers files #81

Closed kmavrommatis closed 2 months ago

kmavrommatis commented 3 years ago
Hi, I am trying to parse the text output files from superFreq in order to get the VAF of specific mutations for each clone. I think the files that are created under the rivers directory should contain this information, but I am having difficulty extract the VAF of the mutation from these files. e.g for a mutation in location chr1: 9737640, Normal is 100% A, Tumor is 28% G Superfreq finds (in file -river.tsv: chr start end name clone sample sample.1 severity type AApos AAbefore AAafter isCosmicCensus
1 9737640 9737640 CLSTN1 (1) intron 2 0.595856955671464 0.112332882535099 22 intron FALSE

Which is assigned to clone 2. Clone 2 is a clone with abundance 51%. Assuming this position is Het, it means that the VAF is ~25% which is within expected value. How can I confirm the VAF of this mutation, or rather find the information if it is homozygous or heterozygous in this clone? What do the values under sample and sample.1 mean?

Is this logic valid or am I missing something? Thanks in advance for your help

ChristofferFlensburg commented 3 years ago

Hey!

The logic is valid. I think sample and sample.1 are the clonalities (note sample cell fraction, not cancer cell fraction) of the variant in the samples.

The river output deals with clonalities as opposed to VAFs, so you wont find VAF information there, although you can reverse-calculate it by matching against local CNA as you suggest. Probably better to look at somaticVariants.xls (or .csv), but that file only has information in samples where it's called, not across all samples. Look at multisample as well, which is across all samples but VAF information. I haven't touched the multisample output in a few years though, so that might be a bit dated. The scatter plots, especially clones.png, can be a good viz of the VAFs in different clones otherwise, but maybe you want the numbers.

The best way to access the raw data otherwise is from the R output in Rdirectory/myIndividual/allVariants.Rdata, which is a nested list where allVariants$variants$variants$mySample is a data frame with all information about all variants that are present in the VCF from any sample in that individual.