Explanation of fields in rivers files

ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data

MIT License

110 stars 33 forks source link

Hi, I am trying to parse the text output files from superFreq in order to get the VAF of specific mutations for each clone. I think the files that are created under the rivers directory should contain this information, but I am having difficulty extract the VAF of the mutation from these files. e.g for a mutation in location chr1: 9737640, Normal is 100% A, Tumor is 28% G Superfreq finds (in file -river.tsv: chr	start	end	name	clone	sample	sample.1	severity	type	AApos	AAbefore	AAafter	isCosmicCensus
1	9737640	9737640	CLSTN1 (1) intron	2	0.595856955671464	0.112332882535099	22	intron	FALSE

Hi, I am trying to parse the text output files from superFreq in order to get the VAF of specific mutations for each clone. I think the files that are created under the rivers directory should contain this information, but I am having difficulty extract the VAF of the mutation from these files. e.g for a mutation in location chr1: 9737640, Normal is 100% A, Tumor is 28% G Superfreq finds (in file -river.tsv: chr

start

end

name

clone

sample

sample.1

severity

type

AApos

AAbefore

AAafter

isCosmicCensus

9737640

CLSTN1 (1) intron

0.595856955671464

0.112332882535099

intron

FALSE

Hey!

The logic is valid. I think sample and sample.1 are the clonalities (note sample cell fraction, not cancer cell fraction) of the variant in the samples.

The river output deals with clonalities as opposed to VAFs, so you wont find VAF information there, although you can reverse-calculate it by matching against local CNA as you suggest. Probably better to look at somaticVariants.xls (or .csv), but that file only has information in samples where it's called, not across all samples. Look at multisample as well, which is across all samples but VAF information. I haven't touched the multisample output in a few years though, so that might be a bit dated. The scatter plots, especially clones.png, can be a good viz of the VAFs in different clones otherwise, but maybe you want the numbers.

The best way to access the raw data otherwise is from the R output in Rdirectory/myIndividual/allVariants.Rdata, which is a nested list where allVariants$variants$variants$mySample is a data frame with all information about all variants that are present in the VCF from any sample in that individual.

ChristofferFlensburg / superFreq

Explanation of fields in rivers files #81