HKU-BAL / ClairS

ClairS - a deep-learning method for long-read somatic small variant calling
BSD 3-Clause "New" or "Revised" License
71 stars 7 forks source link

Adding Normal Sample GT to the VCF file #27

Closed bcantarel closed 3 months ago

bcantarel commented 5 months ago

Would it be possible to add the Normal sample GT/DP/AO to the somatic vcf just for comparison -- for example you can imagine that you have a few "alt reads" in the normal sample compared to 5% or 10% in the tumor which might be much more... we use this to filter out possible FPs. Alternately could you spit out a normal VCF with the ref calls for the same positions in the somatic file. Those could be merged with BCFtools.

Thanks!

zhengzhenxian commented 5 months ago

@bcantarel

Great thanks for your suggestions!

We added some fields to output the normal sample information, including normal depth and alternate count in normal BAM:

##FORMAT=<ID=NDP,Number=1,Type=Integer,Description="Read depth in the normal BAM">
##FORMAT=<ID=NAU,Number=1,Type=Integer,Description="Count of A in the normal BAM">
##FORMAT=<ID=NCU,Number=1,Type=Integer,Description="Count of C in the normal BAM">
##FORMAT=<ID=NGU,Number=1,Type=Integer,Description="Count of G in the normal BAM">
##FORMAT=<ID=NTU,Number=1,Type=Integer,Description="Count of T in the normal BAM">

since version v0.1.1. We also added the count in different strands of the count(FAU, FCU, FGU, FTU, RAU, RCU, RGU, and RTU tags) in v0.1.7. You might directly use these fields for filtering or checking.

For the GT in the normal sample, the outputted candidates are selected with low alternate reads support in the normal sample, and we feed the tumor-normal pair data into NN to decide the genotype collectively. Hence, for the candidates reported, the GT in normal is considered as '0/0'. Please let us know if you have any ideas on it.

bcantarel commented 5 months ago

So how would that look in the data line? ie would the tumor/normal sample have different formats in the same VCF? or would it be another VCF file for the normal sample?

zhengzhenxian commented 5 months ago

Yes, we combined the tumor and normal tags into FORMAT column, here are some lines for reference:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
chr17   80000548        .       G       A       6.885   LowQual H;FAU=15;FCU=0;FGU=43;FTU=0;RAU=13;RCU=0;RGU=48;RTU=0   GT:GQ:DP:AF:AD:NAF:NDP:NAD:AU:CU:GU:TU:NAU:NCU:NGU:NTU  0/1:6:119:0.2353:0,28:0.0000:39:0,0:28:0:91:0:0:0:39:0
chr17   80003901        .       G       C       13.060  PASS    H;FAU=0;FCU=13;FGU=47;FTU=0;RAU=0;RCU=11;RGU=43;RTU=0   GT:GQ:DP:AF:AD:NAF:NDP:NAD:AU:CU:GU:TU:NAU:NCU:NGU:NTU  0/1:13:114:0.2105:0,24:0.0000:32:0,0:0:24:90:0:0:0:32:0
chr17   80005657        .       G       A       15.712  PASS    H;FAU=12;FCU=0;FGU=42;FTU=0;RAU=10;RCU=0;RGU=29;RTU=0   GT:GQ:DP:AF:AD:NAF:NDP:NAD:AU:CU:GU:TU:NAU:NCU:NGU:NTU  0/1:15:93:0.2366:0,22:0.0000:28:0,0:22:0:71:0:0:0:28:0

Have not been implemented to split the normal VCF, but would consider it in further release.