Closed jksull closed 1 year ago
Hey! So there is an answer for this one.
Checking if the SNP is on the reference genome only checks the RS ID of the SNP. Whereas, checking the direction needs to look for the A1 and A2 values in the reference database. This is fine for bi-allelic SNPs however becomes an issue when SNPs are non-bi-allelic since the reference dbSNP we use has just one alt allele even though there are more for non-bi-allelic SNPs. Note you can choose to remove or keep non-bi-allelic SNPs with bi_allelic_filter
parameter. This is the intended behviour from my point of view, as I can't see a better way to deal with this while leaving MSS as open as possibl;e so it works for all use cases.
Note also if you use dbSNP 155, the latest dbSNP version, which is actually the default not 144. You do get back 4 of the 9 observations:
MungeSumstats::format_sumstats(ss,ref_genome = 'GRCh37',dbSNP = 155)
# SNP CHR BP A1 A2 BETA SE FRQ P
#1: rs11572656 1 216914249 C T 0.00294312 0.00472115 0.488433 0.5199996
#2: rs1015232 4 37470379 T C 0.00024100 0.00401689 0.488067 0.8499999
#3: rs1059379 14 23567761 A G -0.00765548 0.00552041 0.156193 0.4000000
#4: rs1001682 15 95446685 C A -0.00297329 0.00407919 0.601510 0.5400003
Checking if the SNP is on the reference genome only checks the RS ID of the SNP. Whereas, checking the direction needs to look for the A1 and A2 values in the reference database.
I'm a little confused by this. The input data already has an ID
column, and so they should be checked against the reference and subsequently found, regardless of the direction?
This is fine for bi-allelic SNPs however becomes an issue when SNPs are non-bi-allelic since the reference dbSNP we use has just one alt allele even though there are more for non-bi-allelic SNPs.
I may be misunderstanding this, but since there are already ID
, so there is no need to check the alleles on the reference in this case?
get back 4 of the 9 observations
Thanks for pointing this out! I prefer not to use the updated db 155
as I was weary of the amount of SNPs lost with it, and admittedly I haven't had the time to test out what happens with the bi_allelic_filter=FALSE
. I'm curious how allele_flip_check
behaves when keeping multi-allelic SNPs, given that you describe the check requiring both A1 AND
A2 to be on the reference, and that only one ALT allele is present in the database.
Having said that (and only somewhat related to this issue), I agree with the sensibility of the approach suggested in https://github.com/neurogenomics/MungeSumstats/issues/111#issuecomment-1235645760, and have been hoping to make use of such a change as I think it would solve a lot of concerns with how MSS handles multi-allelic SNPs.
Edit: Really sorry for closing and re-opening (again!)
I'm a little confused by this. The input data already has an ID column, and so they should be checked against the reference and subsequently found, regardless of the direction?
Yep they are, that's why they are all found when you don't run the allele_flip_check. However, when you run the allele_flip_check, we need to infer if the direction is correct and since MSS can't find the SNPs based on allele columns they are dropped. You can stop these being dropped using allele_flip_drop
parameter. This will keep them in the dataset but note no flipping will be done (which could mean that the effect columns are incorrect).
I may be misunderstanding this, but since there are already ID, so there is no need to check the alleles on the reference in this case?
Again, only if you want to check that the direction is correct
I'm curious how allele_flip_check behaves when keeping multi-allelic SNPs, given that you describe the check requiring both A1 AND A2 to be on the reference, and that only one ALT allele is present in the database.
The code is here. In short, they can have their values flipped expect for the FRQ column since instead of just the ref and alt allele , their are multiple alt alleles so 1- current FRQ won't work. They can be flipped since the code first looks for a match to the ref allele in A1 or A2 so even if the alt doesn't match it's okay.
Having said that (and only somewhat related to this issue), I agree with the sensibility of the approach suggested in https://github.com/neurogenomics/MungeSumstats/issues/111#issuecomment-1235645760, and have been hoping to make use of such a change as I think it would solve a lot of concerns with how MSS handles multi-allelic SNPs.
I'm not sure what you mean by this, sorry?
Thanks for the quick response! I understand what you mean now, sorry for the confusion. I do however have some further
The code is here. In short, they can have their values flipped expect for the FRQ column since instead of just the ref and alt allele , their are multiple alt alleles so 1- current FRQ won't work. They can be flipped since the code first looks for a match to the ref allele in A1 or A2 so even if the alt doesn't match it's okay.
From my understanding, FRQ is very data-dependent. After some consideration, I am struggling to see the reason why 1- current FRQ
shouldn't be computed for multi-allelic SNPs, as long as only one of the non-biallelic SNPs are
You can stop these being dropped using allele_flip_drop parameter. This will keep them in the dataset but note no flipping will be done (which could mean that the effect columns are incorrect).
Does this mean that no flipping will be done at all, or just for those that meet the criteria?
Also, to your earlier point:
Note also if you use dbSNP 155, the latest dbSNP version, which is actually the default not 144. You do get back 4 of the 9 observations:
I realise now that these four SNPs are those that are non-bialellic, and so they must be missing from the dbSNP 144
database, do I understand that correctly?
From my understanding, FRQ is very data-dependent. After some consideration, I am struggling to see the reason why 1- current FRQ shouldn't be computed for multi-allelic SNPs, as long as only one of the non-biallelic SNPs are
So my understanding is that FRQ means the frequency at a given position, often but not always, the minor allele frequency. If a SNP is not bi-allelic, that means there is another allele in that location which has it's own frequency. This means that flipping it by doing 1 - FRQ of the allele wouldn't work since the total FRQ at the location is not just this SNP and the major allele, it is this SNP, the major allele and the other allele.
Does this mean that no flipping will be done at all, or just for those that meet the criteria?
Just those that meet the criteria.
I realise now that these four SNPs are those that are non-bialellic, and so they must be missing from the dbSNP 144 database, do I understand that correctly?
Potentially, I haven't checked the reason why, feel free to look at the bioconductor dbSNP reference sets if you like and compare to running MSS with the logging of dropped SNPs on.
I believe this is now sorted but do reopen if not?
Cheers, Alan.
1. Bug description
Take the following input data:
After using default parameters (MSS 1.7.8),
GRCh37
, anddbSNP 144
), all of the above variants get output toalleles_dont_match_ref_gen.tsv
.alleles_dont_match_ref_gen.tsv
Expected behaviour
However, by simply turning the
allele_flip_check = FALSE
, all of the variants are indeed found on the reference:TEST-DATA.FLIPCHECK-FALSE.munged.tsv
The
alleles_dont_match_ref_gen.tsv
is empty in this case.The full log of this run can be seen here for the
allele_flip_check = TRUE
:And now with
allele_flip_check = FALSE
:3. Session info
It must be noted that we (a colleague) have spun up a container for MSS and so for some reason beyond my expertise, MSS and some of it's dependencies don't appear in the session info. But I can confirm that the necessary packages are up-to-date with MSS 1.7.8.
e.g.
Please let me know if I can provide any additional context.