broadinstitute / tensorqtl

Ultrafast GPU-enabled QTL mapper
BSD 3-Clause "New" or "Revised" License
162 stars 52 forks source link

Discordant num_var between fastQTL and tensorQTL #101

Closed plantformatics closed 1 year ago

plantformatics commented 1 year ago

First off, I'd like to say thank you for developing this software, I've been blown away by the speed! I am trying to figure out why the number of surveyed variants in tensorQTL is drastically lower than fastQTL. num_var for any given phenotype is ~1/4 the value from fastQTL. I used identical parameters (Maf < 0.05) and loaded dosages in both fastQTL (vcf) and tensorQTL (Plink2 psam/pgen/pvar). Are there hard-coded variant filters being applied to tensorQTL that are absent in fastQTL? I get drastically fewer QTL being identified as a result and I'd like to understand why.

francois-a commented 1 year ago

Hi, tensorQTL will filter out monomorphic variants if any are present, but in general the output should be identical to fastQTL. Are you able to provide a small test dataset that reproduces this behavior?

plantformatics commented 1 year ago

I could send a vcf and plink2 binaries for a single phenotype (along with a covariates file) - does this work?

Edit: I should also mention that the plink2 binaries were developed from the VCF which only had dosage information present.

plantformatics commented 1 year ago

@francois-a I sent a collaboration request for a repo containing a subset of data and readme with additional info. Since the main difference between fastQTL and tensorQTL input is the use of plink2 for genotype information, my guess is that the plink2 conversion may be what's responsible. I rarely use plink so I have limited understanding of how the conversion works.

Edit: What is the order of operations for MAF filtering and genotype imputation? When I disable MAF filter on tensorQTL (but still require maf_threshold 0.05 for fastQTL), num_var now more comparable, but still not identical.

francois-a commented 1 year ago

Thanks for sharing the test dataset! The issue is related to missing values — in the current tensorQTL implementation, it is assumed that genotype dosages don't contain missing values; only missing hard calls are imputed (to the mean). So it's not a PLINK2 issue per se, but rather that the missing values (which are read as -9) are not properly handled.

MAF filtering is performed after imputation.

plantformatics commented 1 year ago

Makes sense - thanks for taking a look. This is super helpful!