cancerit / ascatNgs

Somatic copy number analysis using WGS paired end wholegenome sequencing
http://cancerit.github.io/ascatNgs/
GNU Affero General Public License v3.0
68 stars 17 forks source link

GRCh38 Reference file to generate normal BAF values #111

Closed hannanw closed 2 years ago

hannanw commented 2 years ago

Hi,

I am trying to obtain the BAF values from my normal BAM file after running it through the alleleCounter.pl. Which reference file should I use to calculate the BAF values, I tried using this file provided in the GRCh38 reference file bundle qcGenotype_GRCh38_hla_decoy_ebv/verifyBamID_snps.vcf.gz, but there are quite a few probes that are missing in that file that are present in the output from alleleCounter. Could you point me to the appropriate reference file to calculate the BAF values for all the probes present in my normal BAM file. Thanks!

Warmest regards. Hannan

keiranmraine commented 2 years ago

If you run the ascat.pl script the BAF values for the analysis are output in *.ascat_ngs.cn.tsv.gz. Please note this package is intended for illumina style paired end, whole-genome sequencing with tumour normal pairs only. If you want more generic use please see the ASCAT R library.

https://github.com/VanLoo-lab/ascat

hannanw commented 2 years ago

Hi,

Thanks for the speedy reply, however the BAF values are only for the tumor sample. I am interested in the BAF values for the normal sample. Do you know how I can obtain them? Thanks!

Warmest regards, Hannan

keiranmraine commented 2 years ago

You would need to go back to working with the underlying R library as far as I am aware. We only really support the wrapper and counting code.

However I think you are looking for the ascat/SnpGcCorrections.tsv file which is part of the CNV_SV_ref_GRCh38_hla_decoy_ebv_brass6+.tar.gz bundle.

wget ftp://ftp.sanger.ac.uk/pub/cancer/dockstore/human/GRCh38_hla_decoy_ebv/CNV_SV_ref_GRCh38_hla_decoy_ebv_brass6+.tar.gz
hannanw commented 2 years ago

Yup, I'm interested in the counting code cause to get the BAF value for a particular probe I need to know which allele is the reference and which is the variant. So, for example with the output of the alleleCounter of my normal BAM file below

#CHR | POS | Count_A | Count_C | Count_G | Count_T | Good_depth -- | -- | -- | -- | -- | -- | -- chr1 | 95440 | 0 | 0 | 0 | 0 | 0 chr1 | 104186 | 0 | 1 | 0 | 9 | 10 chr1 | 122872 | 0 | 0 | 2 | 5 | 7 chr1 | 125271 | 0 | 0 | 0 | 0 | 0 chr1 | 135982 | 0 | 0 | 0 | 0 | 0

I need to know for chr1 positions 104186 , and 122872 which is the reference and which is the variant allele to calculate the BAF for the corresponding probe. I am looking for a file similar to that from qcGenotype_GRCh38_hla_decoy_ebv/verifyBamID_snps.vcf.gz which looks like this.

CHROM POS ID REF ALT QUAL FILTER INFO

0 chr1 629241 rs10458597 C T . PASS AF=0.01572 1 chr1 629393 rs9629043 C T . PASS AF=0.05000 2 chr1 632373 rs11510103 A G . PASS AF=0.05000 3 chr1 785910 rs12565286 G C . PASS AF=0.05573 4 chr1 805477 rs12082473 G A . PASS AF=0.07838

The ascat/SnpGcCorrections.tsv file only contains the probe name, chromosome and position I need the allele information. I guess what I am asking for is the file that is used to calculate the BAF values in your wrapper code before passing it on to the base ASCAT code in R. Hope this clarifies things, thanks!

Warmest regards, Hannan

keiranmraine commented 2 years ago

The ASCAT R function is never informed of the reference base, the genome.fa is only used in a few places unrelated to ASCATs interpretation. This hack will pull the data into a file but further processing would be required:

samtools faidx genome.fa -r <(tail -n +2 ascat/SnpGcCorrections.tsv | perl -ane 'printf qq{%s:%d-%d\n}, $F[1],$F[2],$F[2]') > loci.txt

Gives:

>chr1:13116-13116
T
>chr1:15274-15274
A
...

As indicated above, please contact the authors of the R library if you require further information.

hannanw commented 2 years ago

Ah okay, that looks like something I can work with. Thanks for the help!