freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
131 stars 22 forks source link

VCF lines lacking GT tag #50

Closed kasia-te closed 1 year ago

kasia-te commented 1 year ago

Dear freeseek,

I installed the gtc2vcf plugin yesterday in docker: https://gitlab.com/intelliseq/workflows/-/blob/BIOINFO-998-genotype-source/src/main/docker/task/task_gtc-to-vcf/Dockerfile (the reference is added later). The plugin works without raising any error, but some vcf lines don't have the GT tag:

chr1    30345446        22:24375752_CNV_GSTT1   A       C       .       .       GC=0.4625;ALLELE_A=0;ALLELE_B=1;FRAC_A=0.360656;FRAC_C=0.262295;FRAC_G=0.229508;FRAC_T=0.147541;NORM_ID=1;BEADSET_ID=1705;INTENSITY_ONLY;ASSAY_TYPE=0;GenTrain_Score=0;Orig_Score=0.68275;Cluster_Sep=0.948275;N_AA=1236;N_AB=0;N_BB=0;devR_AA=0.30742;devR_AB=0.39422;devR_BB=0.20131;devTHETA_AA=0.0121041;devTHETA_AB=0.0223607;devTHETA_BB=0.0223607;meanR_AA=2.73401;meanR_AB=3.4935;meanR_BB=2.30665;meanTHETA_AA=0.130089;meanTHETA_AB=0.554171;meanTHETA_BB=0.978252;Intensity_Threshold=0.05     GQ:IGC:BAF:LRR:NORMX:NORMY:R:THETA:X:Y   0:0:0.0246714:-0.31685:1.76758:0.427339:2.19492:0.151014:32616:2228     0:0:0.0246714:-0.31685:1.76758:0.427339:2.19492:0.151014:32616:2228
chr1    109685814       1:110228436_CNV_GSTM1   T       C       .       .       GC=0.385;ALLELE_A=0;ALLELE_B=1;FRAC_A=0.147541;FRAC_C=0;FRAC_G=0.180328;FRAC_T=0.672131;NORM_ID=0;BEADSET_ID=1625;INTENSITY_ONLY;ASSAY_TYPE=0;GenTrain_Score=0;Orig_Score=0.376871;Cluster_Sep=0.173743;N_AA=0;N_AB=0;N_BB=1239;devR_AA=0.1;devR_AB=0.1;devR_BB=0.1;devTHETA_AA=0.0223607;devTHETA_AB=0.0223607;devTHETA_BB=0.140788;meanR_AA=0.17845;meanR_AB=0.194985;meanR_BB=0.207459;meanTHETA_AA=0.0145364;meanTHETA_AB=0.297995;meanTHETA_BB=0.581454;Intensity_Threshold=0.05     GQ:IGC:BAF:LRR:NORMX:NORMY:R:THETA:X:Y  0:0:1.28708:-0.170678:0.0549625:0.129349:0.184312:0.744207:1017:614      0:0:1.28708:-0.170678:0.0549625:0.129349:0.184312:0.744207:1017:614

The program is run with this wdl: https://gitlab.com/intelliseq/workflows/-/blob/dev/src/main/wdl/tasks/gtc-to-vcf/gtc-to-vcf.wdl

Is it intentional? This has not happened with the previous installation (bcftools11-54-gaf54707, htslib1.11-74-gb8dcbd1 and gtc2vcf cloned on 2021-01-20). Best, Kasia

freeseek commented 1 year ago

Yes, it is intentional and the reason is that variants with the INTENSITY_ONLY flag are not genotyped by Illumina GenCall as they are not expected to be polymorphic, even if an alternate allele is provided

Also, notice that for these variants the computation of the LRR is slightly different as shown by the code here:

get_baf_lrr(ilmn_theta, ilmn_r, cluster_record->aa_cluster_stats.theta_mean,
  cluster_record->ab_cluster_stats.theta_mean,
  cluster_record->bb_cluster_stats.theta_mean, cluster_record->aa_cluster_stats.r_mean,
  cluster_record->ab_cluster_stats.r_mean, cluster_record->bb_cluster_stats.r_mean,
  locus_entry->intensity_only ? cluster_record->r_mean : NAN, &baf, &lrr);

which calls the function:

static inline void get_baf_lrr(float ilmn_theta, float ilmn_r, float aa_theta, float ab_theta, float bb_theta,
                               float aa_r, float ab_r, float bb_r, float r_mean, float *baf, float *lrr) {
    float r_ref;
    if (ilmn_theta == ab_theta) {
        r_ref = ab_r;
        *baf = 0.5f;
    } else if (ilmn_theta < ab_theta) {
        float slope = (aa_r - ab_r) / (aa_theta - ab_theta);
        float b = aa_r - (aa_theta * slope);
        r_ref = (slope * ilmn_theta) + b;
        *baf = 0.5f - (ab_theta - ilmn_theta) * 0.5f / (ab_theta - aa_theta);
    } else if (ilmn_theta > ab_theta) {
        float slope = (ab_r - bb_r) / (ab_theta - bb_theta);
        float b = ab_r - (ab_theta * slope);
        r_ref = (slope * ilmn_theta) + b;
        *baf = 1.0f - (bb_theta - ilmn_theta) * 0.5f / (bb_theta - ab_theta);
    } else {
        *lrr = -NAN;
        *baf = -NAN;
        return;
    }
    // for non-polymorphic (Illumina) markers we compute the LRR using the clusters mean
    *lrr = logf(ilmn_r / (isnan(r_mean) ? r_ref : r_mean)) * (float)M_LOG2E;
}

This aligns the LRR computation with the LRR computation performed by the Illumina iaap-cli software

kasia-te commented 1 year ago

Thank you for the response. I that case I can just remove these lines from the downstream vcfs.

All the best, Kasia