ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
109 stars 33 forks source link

Columns of CNAbyGene.tsv file #125

Closed mlegarreta00 closed 1 month ago

mlegarreta00 commented 2 months ago

Good morning, I was wondering if someone knows what the different columns mean in the CNAbyGene_{sample}.tsv (the M columnt, the width column, etc.). Thank you in advance.

ChristofferFlensburg commented 2 months ago

chr, start, end, gene: should be self explanatory x1, x2: single genome-wide coordinate superFreq uses internally running from 1 to ~3B across all chromosomes. M, width: Log fold change and uncertainty of the read count with respect to the reference normals. df: degrees of freedom of the t distribution used to model the log fold change and error above (from limma-voom) var, cov, Nsnps: across all heterozygous germline variants in the gene, the number of minor (as in lowest VAF) allele counts (var), total read depth (cov) and number of variants (Nsnps) pHet, pAlt, odsHet: p value for the null hypothesis of balanced alleles (pHet), for the allelic balance var/cov (pAlt), and the ods between the two hypotheses (odsHet).

There are some pretty involved stats going on for both the read depth and the BAFs, I believe that should be somewhat covered in the manual, and otherwise in the paper.