broadinstitute / tensorqtl

Ultrafast GPU-enabled QTL mapper
BSD 3-Clause "New" or "Revised" License
160 stars 52 forks source link

Tensorqtl skip some variants in cis mode #163

Closed zhangpicb closed 1 week ago

zhangpicb commented 2 weeks ago

Tensorqtl skip some variants in cis mode,causing some genes become eGene. TUBB3 gene in GTEx Brain_Cortex

Brain_Cortex.v8.egenes.txt.gz Download from gtex portal TUBB3 test 5927 SNPs,qvalue 0.050

gene_id gene_name       gene_chr        gene_start      gene_end        strand  num_var beta_shape1     beta_shape2     true_df pval_true_df    variant_id      tss_distance    chr     variant_pos     ref     alt     num_alt_per_site        rs_id_dbSNP151_GRCh38p7 minor_allele_samples       minor_allele_count      maf     ref_factor      pval_nominal    slope   slope_se        pval_perm       pval_beta       qval    pval_nominal_threshold  log2_aFC        log2_aFC_lower  log2_aFC_upper
ENSG00000258947.6       TUBB3   chr16   89921392        89938761        +       5927    1.05419 511.503 140.121 0.000118315     chr16_89935882_T_C_b38  14490   chr16   89935882        T       C       1       rs1135425       150     200     0.487805        1       2.93434e-05-0.164055       0.0381673       0.0466953       0.0492201       0.0501266       0.000117894     -0.149629       -0.207624       -0.062803

Tenosrqtl TUBB3 test 5923 SNPs,qvalue 0.0477745

phenotype_id    gene_name       gene_chr        gene_start      gene_end        strand  num_var beta_shape1     beta_shape2     true_df pval_true_df    variant_id      start_distance  ma_samples      ma_count        af      pval_nominal    slope   slope_se        pval_perm pval_beta        qval    pval_nominal_threshold
ENSG00000258947.6       TUBB3   chr16   89921392        89938761        +       5923    1.06533 534.597 141.329 0.000110543     chr16_89935882_T_C_b38  14490   150     200     0.487805        2.93434e-05     -0.164055       0.0381673       0.0418958       0.0463013       0.0477745  0.000117133

check bim file used for GTEx Brain_Cortex Tensorqtl cis mode,5927 SNPs are inclued,4 SNPs are not tested

cis-nomial test 5927 SNPs

zcat Brain_Cortex.txt.gz |grep ENSG00000258947.6|wc -l
5927
francois-a commented 1 week ago

These four variants are skipped on purpose in TensorQTL, since they are monomorphic for the 205 Brain_Cortex samples. If you use the --warn_monomorphic flag you'll see this in the log. This unlikely has an effect on the significance of the gene. The small differences in pval_beta and qval that you're seeing between your run and V8 is likely due to a different set of permutations being used. Also make sure you're using --qvalue_lambda 0.85 to match V8 settings.

zhangpicb commented 1 week ago

Thanks for your quick reply!

First,

bcftools filter -e "MAF < 0.01" GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz  -Oz -o GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.MAF001.vcf.gz
tabix -p vcf GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.MAF001.vcf.gz

plink2 --make-bed --keep-allele-order --double-id \
    --output-chr chrM \
    --vcf GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.MAF001.vcf.gz \
    --out GTEx.v8

python3 -m tensorqtl --mode cis \
GTEx.v8 \
Brain_Cortex.v8.normalized_expression.bed.gz \
Brain_Cortex \
--covariates Brain_Cortex.v8.covariates.txt \
--window 1000000 --seed 3042235018 --permutations 10000 --fdr 0.05 --qvalue_lambda 0.85 \
-o GTEx

seed from https://github.com/broadinstitute/tensorqtl/blob/master/example/GTEx_v8_example.ipynb Brain_Cortex related files are download directed from dbgap or gtex portal

And I use FastQTL to run permutation TUBB3 still a eGene,but not a eGene for GTEx portal download file

ENSG00000258947.6       5927    1.05606 500.296 139.718 0.000121028     chr16_89935882_T_C_b38  14490   150     200     0.487805        1       2.93434e-05     -0.164055       0.0381673       0.0474953       0.0489463       0.0466502       0.000131857

Sencond,

TUBB3 5927 SNPs have 7 monomorphic SNPs,I guess Tensorqtl maybe remove long INDEL,not all monomorphic SNPs

chr16_88975138_T_TCCA_b38
chr16_88975138_TGCAG_T_b38
chr16_89158355_C_A_b38
chr16_89158355_C_CTCACACGCCTGCATATGTCACATGTATGTGTATA_b38
chr16_89165405_G_GAGACCTGGATTGCAGGCCAGGCGGCAGC_b38
chr16_89165405_G_GC_b38
chr16_89230997_C_T_b38
chr16_89230997_CCTGGATGGACAGACCTGTGCACCTTCATGTCACGTGTTTGAAATT_C_b38
chr16_89233774_A_ACTGCGTATGCAACTCAG_b38
chr16_89233774_A_G_b38
chr16_89242688_A_C_b38
chr16_89242688_A_G_b38
chr16_89591801_C_A_b38
chr16_89591801_C_T_b38

Thanks in advanced!

francois-a commented 1 week ago

None of the variants you listed are monomorphic (they all have some missingness though). You can get the exact VCF that was used for V8 from AnVIL: https://console.cloud.google.com/storage/browser/_details/fc-secure-ff8156a3-ddf3-42e4-9211-0fd89da62108/GTEx_Analysis_2017-06-05_v8_WGS_VCF_files/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.MAF01.vcf.gz

The example notebook in this repository is to illustrate how V8 can be run, but does not exactly replicate it due to different seeds/software being used.

zhangpicb commented 1 week ago

Thanks for quick reply! I made an incorrect assumption about monomorphic,I think you mean multi allelic variants.I query vcf for Brain_Cortex samples and find 4 monomorphic. ITensorqtl treat variants numbers more carefully than Fastqtl. Thanks a lot!