bogdanlab / fizi

Leveraging functional information to improve GWAS summary statistic imputation
GNU General Public License v3.0
20 stars 4 forks source link

Running sumstat imputation #5

Open marcustutert opened 3 years ago

marcustutert commented 3 years ago

Hi,

I am trying to figure out how to use fizi to perform sumstat imputation. At the moment, I am just using some simulated data from HAPGEN2 and SNPTEST to run some very small toy code to get the hang of this software.

I've managed to 'munge' my sumstats correctly and also have a reference panel in PLINK bfile format that contains a bunch more SNPs in the region than the sumstats file does (which are the ones I would in theory be wishing to impute). However, when I run the impute command I am left with errors suggesting no SNPs are being found in certain regions.

My command and output is:

fizi impute
    munged_filtered
    test

Starting log...
[2021-09-28 12:10:54 - INFO] Preparing GWAS summary file
[2021-09-28 12:10:54 - INFO] Preparing reference SNP data
[2021-09-28 12:10:54 - INFO] Starting summary statistics imputation with window size 250000.0 and buffer size 250000.0
[2021-09-28 12:10:54 - WARNING] No GWAS SNPs found at snp_0:15800075 - snp_0:16300075. Skipping
[2021-09-28 12:10:54 - WARNING] No GWAS SNPs found at snp_1:15800115 - snp_1:16300115. Skipping
[2021-09-28 12:10:54 - WARNING] No GWAS SNPs found at snp_2:15800213 - snp_2:16300213. Skipping

With these warnings continuing for a number of lines.

Could you help me figure out what it is I am doing wrong here? Best, Marcus

marcustutert commented 3 years ago

Hi, I have figured out that this was due to my .bim file not containing a chr identifier as I thought it had

marcustutert commented 3 years ago

Apologies, but have now run into a new error I'm trying to track down where I see:

fizi impute
    munged_filtered
    test
    --start 16050075
    --stop 16139887
    --chr 1
    --min-prop 0.1

Starting log...
[2021-09-28 12:40:21 - INFO] Preparing GWAS summary file
[2021-09-28 12:40:21 - INFO] Preparing reference SNP data
[2021-09-28 12:40:21 - INFO] Starting summary statistics imputation with window size 250000.0 and buffer size 250000.0
[2021-09-28 12:40:21 - INFO] Starting imputation at region 1:16050075 - 1:16136566
[2021-09-28 12:40:21 - ERROR] Cannot convert non-finite values (NA or inf) to integer

Since my sumstats have already been munged there are no more non-finite values in them, so I am wondering where this error might be getting thrown

quattro commented 3 years ago

Hi @marcustutert , can you re-run with the --verbose flag and report the output? Also, can you load your sumstats in R (or python) and double check that there are no NA/NaN/Inf values.

marcustutert commented 3 years ago

Hi @quattro, I managed to solve the problem and fixed my pipeline -- think there was an issue on my end re: how some of my data was formatted. However, I have a follow-up question regarding what SNPs FIZI uses for imputation and what SNPs are actually imputed.

I've run the following:

> fizi_impute()
====================================
               FIZI v0.7.2             
====================================
fizi impute
    hapgen2_sim_data/genotyped_sumstats_munged
    hapgen2_sim_data/ref_panel
    --out hapgen2_sim_data/imputed_sumstats
    --min-prop 0.0001
    --verbose

Starting log...
[2021-10-06 17:37:41 - INFO] Preparing GWAS summary file
[2021-10-06 17:37:41 - INFO] Preparing reference SNP data
[2021-10-06 17:37:41 - INFO] Starting summary statistics imputation with window size 250000.0 and buffer size 250000.0
[2021-10-06 17:37:41 - DEBUG] Subsetting GWAS data by 22:15800213 - 22:16550213
[2021-10-06 17:37:41 - DEBUG] Subsetting reference SNP data by 22:15800213 - 22:16550213
[2021-10-06 17:37:41 - INFO] Starting imputation at region 22:16050213 - 22:16548755
[2021-10-06 17:37:41 - DEBUG] Proportion of observed-SNPs / total-SNPs = 0.679
[2021-10-06 17:37:41 - DEBUG] Flipped 0 alleles to match reference
[2021-10-06 17:37:41 - DEBUG] Estimating LD for 477 SNPs
[2021-10-06 17:37:41 - DEBUG] Partitioning LD into quadrants
[2021-10-06 17:37:41 - DEBUG] Computing inverse of variance-covariance matrix for 324 observed SNPs
[2021-10-06 17:37:41 - DEBUG] Imputing 153 SNPs from 324 observed scores
[2021-10-06 17:37:41 - INFO] Completed imputation at region 22:16050213 - 22:16548755
[2021-10-06 17:37:41 - DEBUG] Subsetting GWAS data by 22:16050214 - 22:16800214
[2021-10-06 17:37:41 - DEBUG] Subsetting reference SNP data by 22:16050214 - 22:16800214
[2021-10-06 17:37:41 - INFO] Starting imputation at region 22:16050527 - 22:16696637
[2021-10-06 17:37:41 - DEBUG] Proportion of observed-SNPs / total-SNPs = 0.647
[2021-10-06 17:37:41 - DEBUG] Flipped 0 alleles to match reference
[2021-10-06 17:37:41 - DEBUG] Estimating LD for 691 SNPs
[2021-10-06 17:37:41 - DEBUG] Partitioning LD into quadrants
[2021-10-06 17:37:41 - DEBUG] Computing inverse of variance-covariance matrix for 447 observed SNPs
[2021-10-06 17:37:41 - DEBUG] Imputing 244 SNPs from 447 observed scores
[2021-10-06 17:37:41 - INFO] Completed imputation at region 22:16050527 - 22:16696637
[2021-10-06 17:37:41 - DEBUG] Subsetting GWAS data by 22:16300215 - 22:17050215
[2021-10-06 17:37:41 - DEBUG] Subsetting reference SNP data by 22:16300215 - 22:17050215
[2021-10-06 17:37:41 - INFO] Starting imputation at region 22:16302198 - 22:17050088
[2021-10-06 17:37:41 - DEBUG] Proportion of observed-SNPs / total-SNPs = 0.668
[2021-10-06 17:37:41 - DEBUG] Flipped 0 alleles to match reference
[2021-10-06 17:37:41 - DEBUG] Estimating LD for 1628 SNPs
[2021-10-06 17:37:41 - DEBUG] Partitioning LD into quadrants
[2021-10-06 17:37:41 - DEBUG] Computing inverse of variance-covariance matrix for 1087 observed SNPs
[2021-10-06 17:37:42 - DEBUG] Imputing 541 SNPs from 1087 observed scores
[2021-10-06 17:37:42 - INFO] Completed imputation at region 22:16302198 - 22:17050088
[2021-10-06 17:37:42 - DEBUG] Subsetting GWAS data by 22:16550216 - 22:17300216
[2021-10-06 17:37:42 - DEBUG] Subsetting reference SNP data by 22:16550216 - 22:17300216
[2021-10-06 17:37:42 - INFO] Starting imputation at region 22:16550406 - 22:17099883
[2021-10-06 17:37:42 - DEBUG] Proportion of observed-SNPs / total-SNPs = 0.679
[2021-10-06 17:37:42 - DEBUG] Flipped 0 alleles to match reference
[2021-10-06 17:37:42 - DEBUG] Estimating LD for 1641 SNPs
[2021-10-06 17:37:42 - DEBUG] Partitioning LD into quadrants
[2021-10-06 17:37:42 - DEBUG] Computing inverse of variance-covariance matrix for 1115 observed SNPs
[2021-10-06 17:37:43 - DEBUG] Imputing 526 SNPs from 1115 observed scores
[2021-10-06 17:37:43 - INFO] Completed imputation at region 22:16550406 - 22:17099883
[2021-10-06 17:37:43 - DEBUG] Subsetting GWAS data by 22:16800217 - 22:17349965
[2021-10-06 17:37:43 - DEBUG] Subsetting reference SNP data by 22:16800217 - 22:17349965
[2021-10-06 17:37:43 - INFO] Starting imputation at region 22:16847937 - 22:17099883
[2021-10-06 17:37:43 - DEBUG] Proportion of observed-SNPs / total-SNPs = 0.695
[2021-10-06 17:37:43 - DEBUG] Flipped 0 alleles to match reference
[2021-10-06 17:37:43 - DEBUG] Estimating LD for 1427 SNPs
[2021-10-06 17:37:43 - DEBUG] Partitioning LD into quadrants
[2021-10-06 17:37:43 - DEBUG] Computing inverse of variance-covariance matrix for 992 observed SNPs
[2021-10-06 17:37:43 - DEBUG] Imputing 435 SNPs from 992 observed scores
[2021-10-06 17:37:43 - INFO] Completed imputation at region 22:16847937 - 22:17099883
[2021-10-06 17:37:43 - INFO] Finished summary statistic imputation

By my count, there are 1899 SNPs that are imputed according to the logs here. However, in my resulting imputed file I have 2499 observations, of which, 678 are of type IMPUTED and the remaining 1821 are of type GWAS. Can you explain where this discrepancy is occurring? At the moment I am trying to understand exactly what filtering is going on behind the scenes in this tool and its not clear to me what is the process that is occuring.

opain commented 2 years ago

Hi @marcustutert,

Could you please explain how you resolved the 'Cannot convert non-finite values (NA or inf) to integer'? I am currently struggling with the same issue. Perhaps your solution would also fix my problem.

Thank you,

Ollie

quattro commented 2 years ago

Hi @marcustutert , I -think- what's happening here is that imputation occurs within a buffered window, but then the output of which is filtered down to the non-buffered window. This may result in variants being imputed multiple times, but only occurring once in the final output.

That is, if a reference variant falls within a buffered window (but not within the actual start/stop window) and has no GWAS signal, its imputed zscore will be calculated and filtered out (see lines 87-92 in impute.py) . At the next window when the variant falls within the actual start/stop its imputed zscore will be kept.

This logic can be corrected, as it means we're imputing the missing value several times until its proper window is reached.

harleyi commented 2 years ago

Hi @opain If the min-prop parameter is too low, then there is only one genotyped marker to use for the imputation in the region. This gives an error. We are looking into the issue to figure out if this is the usual reason for this error. Increasing the min-prop parameter (i.e. from 0.001 to 0.1) skips those regions and resolves the issue.

opain commented 2 years ago

Hi @harleyi

Thank you for your concern. @quattro, kindly fixed my issue already - It was due to some SNPs in my LD reference being heterozygous in all individuals, leading to a variance of 0. After removing these variants using plink --max-maf 0.49 I no longer received the error.

Best wishes,

Ollie

celisungmail commented 2 years ago

Hi @opain @opain @meganroytman , What is A1 in the input file? What does " --a1-inc A1 is the increasing allele." from help manual mean ? A1 is the effect (risk) allele ? OR minor allele OR both ?

Thanks, Charlie

saramonteiromartins commented 2 years ago

Hi @quattro, I am also having the "Cannot convert non-finite values (NA or inf) to integer" ERROR

Can you help me find out the problem? I checked my gwas.sumstats and I have any NA's or Inf

fizi impute cleaned.gwas.sumstats.gz LDref/chr22 --chr 22 --verbose --out imputed.cleaned.gwas

End of the log file:

[2022-09-01 19:04:12 - INFO] Starting imputation at region 22:19335739 - 22:20084376 [2022-09-01 19:04:12 - DEBUG] Proportion of observed-SNPs / total-SNPs = 0.601 [2022-09-01 19:04:12 - DEBUG] Flipped 742 alleles to match reference [2022-09-01 19:04:12 - DEBUG] Estimating LD for 2056 SNPs [2022-09-01 19:04:13 - DEBUG] Partitioning LD into quadrants [2022-09-01 19:04:13 - DEBUG] Computing inverse of variance-covariance matrix for 1236 observed SNPs [2022-09-01 19:04:13 - DEBUG] Imputing 820 SNPs from 1236 observed scores [2022-09-01 19:04:13 - ERROR] Cannot convert non-finite values (NA or inf) to integer [2022-09-01 19:04:13 - INFO] Finished summary statistic imputation

*Ps. this happens for every chr

Thank you :)