Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

SNPs not found on reference when using (default)`flip_check` even though they do exist #140

Closed jksull closed 1 year ago

jksull commented 1 year ago

1. Bug description

Take the following input data:

chr POS ALT REF ES  SE  AF  ID  P
1   56693574    C   G   0.00619883  0.00792169  0.100044    rs1149656   0.4500005038084225
1   67392889    A   G   0.001181    0.00418459  0.65064 rs11208973  0.9699999234660868
1   216914249   T   C   0.00294312  0.00472115  0.488433    rs11572656  0.5199995885510125
4   20211941    A   G   0.00417027  0.00403417  0.564946    rs1165448   0.17000003077564924
4   37470379    C   T   2.41051e-4  0.00401689  0.488067    rs1015232   0.8499999496720565
11  123473031   A   G   -0.00145969 0.00583595  0.207721    rs1148091   0.7700004872667389
12  43719461    A   G   0.00195281  0.00414411  0.626794    rs10880454  0.7300002351217616
14  23567761    G   A   -0.00765548 0.00552041  0.156193    rs1059379   0.4000000079872419
15  95446685    A   C   -0.00297329 0.00407919  0.60151 rs1001682   0.5400002986352308

After using default parameters (MSS 1.7.8), GRCh37, and dbSNP 144), all of the above variants get output to alleles_dont_match_ref_gen.tsv.

alleles_dont_match_ref_gen.tsv

CHR BP  A2  A1  BETA    SE  FRQ SNP P   ref_gen_allele  match_type
15  95446685    A   C   -0.00297329 0.00407919  0.60151 rs1001682   0.540000298635231   T   NA
4   37470379    C   T   0.000241051 0.00401689  0.488067    rs1015232   0.849999949672057   A   NA
14  23567761    G   A   -0.00765548 0.00552041  0.156193    rs1059379   0.400000007987242   C   NA
12  43719461    A   G   0.00195281  0.00414411  0.626794    rs10880454  0.730000235121762   C   NA
1   67392889    A   G   0.001181    0.00418459  0.65064 rs11208973  0.969999923466087   T   NA
11  123473031   A   G   -0.00145969 0.00583595  0.207721    rs1148091   0.770000487266739   T   NA
1   56693574    C   G   0.00619883  0.00792169  0.100044    rs1149656   0.450000503808423   A   NA
1   216914249   T   C   0.00294312  0.00472115  0.488433    rs11572656  0.519999588551013   A   NA
4   20211941    A   G   0.00417027  0.00403417  0.564946    rs1165448   0.170000030775649   T   NA

Expected behaviour

However, by simply turning the allele_flip_check = FALSE, all of the variants are indeed found on the reference:

TEST-DATA.FLIPCHECK-FALSE.munged.tsv

SNP CHR BP  A1  A2  BETA    SE  FRQ P
rs1149656   1   56693574    G   C   0.00619883  0.00792169  0.100044    0.450000503808423
rs11208973  1   67392889    G   A   0.001181    0.00418459  0.65064 0.969999923466087
rs11572656  1   216914249   C   T   0.00294312  0.00472115  0.488433    0.519999588551013
rs1165448   4   20211941    G   A   0.00417027  0.00403417  0.564946    0.170000030775649
rs1015232   4   37470379    T   C   0.000241051 0.00401689  0.488067    0.849999949672057
rs1148091   11  123473031   G   A   -0.00145969 0.00583595  0.207721    0.770000487266739
rs10880454  12  43719461    G   A   0.00195281  0.00414411  0.626794    0.730000235121762
rs1059379   14  23567761    A   G   -0.00765548 0.00552041  0.156193    0.400000007987242
rs1001682   15  95446685    C   A   -0.00297329 0.00407919  0.60151 0.540000298635231

The alleles_dont_match_ref_gen.tsv is empty in this case.

The full log of this run can be seen here for the allele_flip_check = TRUE:

Standardising column headers.
First line of summary statistics file:
chr POS ALT REF ES  SE  AF  ID  P
Summary statistics report:
   - 9 rows
   - 9 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 6 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 9 SNPs using BSgenome::snpsById...
Loading required package: BiocGenerics
Loading required package: S4Vectors
Loading required package: stats4
BSgenome::snpsById done in 20 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 9 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
Writing in tabular format ==> /workspace/TEST-REF/alleles_dont_match_ref_gen.tsv
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
Warning: When method is an integer, must be >0.
Sorting coordinates.
Writing in tabular format ==> /workspace/TEST-REF/TEST-DATA.FLIPCHECK-TRUE.munged.tsv
Summary statistics report:
   - 0 rows (0% of original 9 rows)
   - 0 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 0 chromosomes
Done munging in 0.369 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
   SNP CHR BP A1 A2 BETA SE FRQ P
1: SNP CHR BP A1 A2 BETA SE FRQ P

And now with allele_flip_check = FALSE:

Standardising column headers.
First line of summary statistics file:
chr POS ALT REF ES  SE  AF  ID  P
Summary statistics report:
   - 9 rows
   - 9 unique variants
   - 0 genome-wide significant variants (P<5e-8)
   - 6 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 9 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 6 seconds.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
Checking for bi-allelic SNPs.
Warning: When method is an integer, must be >0.
4 SNPs (44.4%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> /workspace/TEST-REF/flip_check_false/TEST-DATA.FLIPCHECK-FALSE.munged.tsv

3. Session info

It must be noted that we (a colleague) have spun up a container for MSS and so for some reason beyond my expertise, MSS and some of it's dependencies don't appear in the session info. But I can confirm that the necessary packages are up-to-date with MSS 1.7.8.

e.g.

>packageVersion("MungeSumstats")
[1] '1.7.8'
>packageVersion("BiocManager")
[1] '1.30.19'

Please let me know if I can provide any additional context.

> utils::sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/[libopenblasp-r0.3.20.so](http://libopenblasp-r0.3.20.so/)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] forcats_0.5.2   stringr_1.5.0   dplyr_1.0.10    purrr_1.0.0
[5] readr_2.1.3     tidyr_1.2.1     tibble_3.1.8    ggplot2_3.4.0
[9] tidyverse_1.3.2

loaded via a namespace (and not attached):
 [1] pillar_1.8.1        compiler_4.2.2      cellranger_1.1.0
 [4] dbplyr_2.2.1        tools_4.2.2         bit_4.0.5
 [7] timechange_0.1.1    lubridate_1.9.0     jsonlite_1.8.4
[10] googledrive_2.0.0   lifecycle_1.0.3     gargle_1.2.1
[13] gtable_0.3.1        pkgconfig_2.0.3     rlang_1.0.6
[16] reprex_2.0.2        DBI_1.1.3           cli_3.5.0
[19] parallel_4.2.2      haven_2.5.1         xml2_1.3.3
[22] withr_2.5.0         httr_1.4.4          munger_1.0.0
[25] generics_0.1.3      vctrs_0.5.1         fs_1.5.2
[28] hms_1.1.2           bit64_4.0.5         googlesheets4_1.0.1
[31] grid_4.2.2          tidyselect_1.2.0    glue_1.6.2
[34] R6_2.5.1            fansi_1.0.3         readxl_1.4.1
[37] vroom_1.6.0         tzdb_0.3.0          modelr_0.1.10
[40] logger_0.2.2        magrittr_2.0.3      ellipsis_0.3.2
[43] backports_1.4.1     scales_1.2.1        rvest_1.0.3
[46] assertthat_0.2.1    colorspace_2.0-3    utf8_1.2.2
[49] stringi_1.7.8       munsell_0.5.0       broom_1.0.2
[52] crayon_1.5.2
Al-Murphy commented 1 year ago

Hey! So there is an answer for this one.

Checking if the SNP is on the reference genome only checks the RS ID of the SNP. Whereas, checking the direction needs to look for the A1 and A2 values in the reference database. This is fine for bi-allelic SNPs however becomes an issue when SNPs are non-bi-allelic since the reference dbSNP we use has just one alt allele even though there are more for non-bi-allelic SNPs. Note you can choose to remove or keep non-bi-allelic SNPs with bi_allelic_filter parameter. This is the intended behviour from my point of view, as I can't see a better way to deal with this while leaving MSS as open as possibl;e so it works for all use cases.

Note also if you use dbSNP 155, the latest dbSNP version, which is actually the default not 144. You do get back 4 of the 9 observations:

MungeSumstats::format_sumstats(ss,ref_genome = 'GRCh37',dbSNP = 155)
#          SNP CHR        BP A1 A2        BETA         SE      FRQ         P
#1: rs11572656   1 216914249  C  T  0.00294312 0.00472115 0.488433 0.5199996
#2:  rs1015232   4  37470379  T  C  0.00024100 0.00401689 0.488067 0.8499999
#3:  rs1059379  14  23567761  A  G -0.00765548 0.00552041 0.156193 0.4000000
#4:  rs1001682  15  95446685  C  A -0.00297329 0.00407919 0.601510 0.5400003
jksull commented 1 year ago

Checking if the SNP is on the reference genome only checks the RS ID of the SNP. Whereas, checking the direction needs to look for the A1 and A2 values in the reference database.

I'm a little confused by this. The input data already has an ID column, and so they should be checked against the reference and subsequently found, regardless of the direction?

This is fine for bi-allelic SNPs however becomes an issue when SNPs are non-bi-allelic since the reference dbSNP we use has just one alt allele even though there are more for non-bi-allelic SNPs.

I may be misunderstanding this, but since there are already ID, so there is no need to check the alleles on the reference in this case?

get back 4 of the 9 observations

Thanks for pointing this out! I prefer not to use the updated db 155 as I was weary of the amount of SNPs lost with it, and admittedly I haven't had the time to test out what happens with the bi_allelic_filter=FALSE. I'm curious how allele_flip_check behaves when keeping multi-allelic SNPs, given that you describe the check requiring both A1 AND A2 to be on the reference, and that only one ALT allele is present in the database.

Having said that (and only somewhat related to this issue), I agree with the sensibility of the approach suggested in https://github.com/neurogenomics/MungeSumstats/issues/111#issuecomment-1235645760, and have been hoping to make use of such a change as I think it would solve a lot of concerns with how MSS handles multi-allelic SNPs.

Edit: Really sorry for closing and re-opening (again!)

Al-Murphy commented 1 year ago

I'm a little confused by this. The input data already has an ID column, and so they should be checked against the reference and subsequently found, regardless of the direction?

Yep they are, that's why they are all found when you don't run the allele_flip_check. However, when you run the allele_flip_check, we need to infer if the direction is correct and since MSS can't find the SNPs based on allele columns they are dropped. You can stop these being dropped using allele_flip_drop parameter. This will keep them in the dataset but note no flipping will be done (which could mean that the effect columns are incorrect).

I may be misunderstanding this, but since there are already ID, so there is no need to check the alleles on the reference in this case?

Again, only if you want to check that the direction is correct

I'm curious how allele_flip_check behaves when keeping multi-allelic SNPs, given that you describe the check requiring both A1 AND A2 to be on the reference, and that only one ALT allele is present in the database.

The code is here. In short, they can have their values flipped expect for the FRQ column since instead of just the ref and alt allele , their are multiple alt alleles so 1- current FRQ won't work. They can be flipped since the code first looks for a match to the ref allele in A1 or A2 so even if the alt doesn't match it's okay.

Having said that (and only somewhat related to this issue), I agree with the sensibility of the approach suggested in https://github.com/neurogenomics/MungeSumstats/issues/111#issuecomment-1235645760, and have been hoping to make use of such a change as I think it would solve a lot of concerns with how MSS handles multi-allelic SNPs.

I'm not sure what you mean by this, sorry?

jksull commented 1 year ago

Thanks for the quick response! I understand what you mean now, sorry for the confusion. I do however have some further

The code is here. In short, they can have their values flipped expect for the FRQ column since instead of just the ref and alt allele , their are multiple alt alleles so 1- current FRQ won't work. They can be flipped since the code first looks for a match to the ref allele in A1 or A2 so even if the alt doesn't match it's okay.

From my understanding, FRQ is very data-dependent. After some consideration, I am struggling to see the reason why 1- current FRQ shouldn't be computed for multi-allelic SNPs, as long as only one of the non-biallelic SNPs are

You can stop these being dropped using allele_flip_drop parameter. This will keep them in the dataset but note no flipping will be done (which could mean that the effect columns are incorrect).

Does this mean that no flipping will be done at all, or just for those that meet the criteria?

Also, to your earlier point:

Note also if you use dbSNP 155, the latest dbSNP version, which is actually the default not 144. You do get back 4 of the 9 observations:

I realise now that these four SNPs are those that are non-bialellic, and so they must be missing from the dbSNP 144 database, do I understand that correctly?

Al-Murphy commented 1 year ago

From my understanding, FRQ is very data-dependent. After some consideration, I am struggling to see the reason why 1- current FRQ shouldn't be computed for multi-allelic SNPs, as long as only one of the non-biallelic SNPs are

So my understanding is that FRQ means the frequency at a given position, often but not always, the minor allele frequency. If a SNP is not bi-allelic, that means there is another allele in that location which has it's own frequency. This means that flipping it by doing 1 - FRQ of the allele wouldn't work since the total FRQ at the location is not just this SNP and the major allele, it is this SNP, the major allele and the other allele.

Does this mean that no flipping will be done at all, or just for those that meet the criteria?

Just those that meet the criteria.

I realise now that these four SNPs are those that are non-bialellic, and so they must be missing from the dbSNP 144 database, do I understand that correctly?

Potentially, I haven't checked the reason why, feel free to look at the bioconductor dbSNP reference sets if you like and compare to running MSS with the logging of dropped SNPs on.

Al-Murphy commented 1 year ago

I believe this is now sorted but do reopen if not?

Cheers, Alan.