New error message: 'error: One or more query variants were not found in 1000G VCF file. '

karlsmithbyrne commented 2 years ago

Since yesterday, same query now returns the error message:

error: One or more query variants were not found in 1000G VCF file.

Code: LDmatrix(SNP_List, pop = "CEU", r2d = "r2", token = "my_token", file = FALSE)

Attached is a list of SNPs that trigger the error.

LDLinkR_Error_Query.csv

Harry-young-0 commented 2 years ago

I'm having the same issue with a list that worked the week before last! using - LDmatrix(snps, pop = "EUR", r2d = "r2", token ="token"

timyers commented 2 years ago

There was a recent update to the LDlink server on 05 Apr 2022 that included a change to error checking and handling. Please see LDlink version history. This might be the issue here but I will continue to investigate. I sent the list of the variants provided in the CSV file above to the developer and will post here when I hear back. Thanks for your patience and thanks for using LDlinkR.

timyers commented 2 years ago

The following was summarized from info passed on from the LDlink developer team. The error mentioned was introduced in the LDlink 5.3 release (2022-04-05). LDhap, LDmatrix and SNPclip will throw the error if a user submitted SNP does not yield any results from 1000G data. This was done to provide support for SNPs with 1000G positions that do not match dbSNP. The drawback is that all user submitted SNPs must now have 1000G results. This new approach was adopted to allow support for GRCh38 and GRCh38 High Coverage which have poor rsID annotations. In the LDlinkR_Error_Query.csv query list above, it looks like the last SNP, rs11429065, was not found in 1000G (GRCh37), which will now cause this error to be thrown. I hope that helps. Let me know if you have any more questions or concerns.

timyers commented 2 years ago

Also, thank you both for bringing this issue to our attention. Alternative solutions are already in the works as a result.

Harry-young-0 commented 2 years ago

Fantastic, thanks for the swift response to this. To clarify, does this mean your shifting to a more current genome build position approach? i.e. we should use positions as opposed to rsids? Or just that we should wait for a work around for using rsids not in the 1000G data?

Also, is there a way of checking which rsids are in the 1000G data within R and without having to download the full vcf file? (completely OK if this is just not a thing!)

Cheers,

Harry

Henry (Harry) Young - Wellcome Trust PhD Student Pronouns: He/Him School of Biochemistry | Biomedical Sciences Building | University of Bristol, BS8 1TD UK

From: Tim Myers @.> Sent: 12 April 2022 20:56 To: CBIIT/LDlinkR @.> Cc: Harry Young @.>; Comment @.> Subject: Re: [CBIIT/LDlinkR] New error message: 'error: One or more query variants were not found in 1000G VCF file. ' (Issue #16)

Also, thank you both for bringing this issue to our attention. Alternative solutions are already in the works as a result.

— Reply to this email directly, view it on GitHubhttps://github.com/CBIIT/LDlinkR/issues/16#issuecomment-1097155369, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APZRJRYUCIKRNAZ3GTLGSYLVEXIQPANCNFSM5SYWT3EA. You are receiving this because you commented.Message ID: @.***>

karlsmithbyrne commented 2 years ago

Indeed, that you for getting back so quickly! Its a really excellent service.

timyers commented 2 years ago

Both the LDlink web tool and the LDlinkR R package accept either rsID's or genomic coordinates as input for most of the available tools, including LDmatrix. Currently LDlinkR only supports genome build GRCh37. However, a recent upgrade to LDlink web tool (Release 5.2 2022-01-03) added support for both GRCH37 and GRCh38. I hope to add support for GRCh38 to LDLinkR soon. But the short answer is probably just wait for a work around for using rsID's not in the 1000G data. I apologize for the inconvenience.

I'm not aware of a great way to check which rsID's are in the 1000G data within R. But I did find an R package available on Github called ieugwasr that might do the trick for you. Linked here. Its ld_reflookup() function interacts with the OpenGWAS API which houses LD reference panels for the 5 super-populations in the 1000 genomes reference panel. But it only includes bi-allelic SNPs with a MAF > 0.01. The function will return an array of rsID's that are in the LD reference panel. It worked for a few of the SNPs that I selected from the CSV file. See simple example and its output below. Maybe this will work for you all too.

ieugwasr::ld_reflookup(c(rsid = "rs55999874", "rs62448339", "rs11975856", "rs11429065"), pop = "EUR") [1] "rs11975856" "rs55999874" "rs62448339"

I hope this helps some. Many thanks to you both again for bringing this issue to our attention.

Harry-young-0 commented 2 years ago

fab thanks for this, I'll hold on for a work around. I already have the ieugwasr package so I'll have a play with that too,

cheers,

Harry

Henry (Harry) Young - Wellcome Trust PhD Student Pronouns: He/Him School of Biochemistry | Biomedical Sciences Building | University of Bristol, BS8 1TD UK

From: Tim Myers @.> Sent: 13 April 2022 15:38 To: CBIIT/LDlinkR @.> Cc: Harry Young @.>; Comment @.> Subject: Re: [CBIIT/LDlinkR] New error message: 'error: One or more query variants were not found in 1000G VCF file. ' (Issue #16)

Both the LDlink web tool and the LDlinkR R package accept either rsID's or genomic coordinates as input for most of the available tools, including LDmatrix. Currently LDlinkR only supports genome build GRCh37. However, a recent upgrade to LDlink web tool (Release 5.2 2022-01-03) added support for both GRCH37 and GRCh38. I hope to add support for GRCh38 to LDLinkR soon. But the short answer is probably just wait for a work around for using rsID's not in the 1000G data. I apologize for the inconvenience.

I'm not aware of a great way to check which rsID's are in the 1000G data within R. But I did find an R package available on Github called ieugwasr that might do the trick for you. Linked herehttps://github.com/MRCIEU/ieugwasr. Its ld_reflookup() function interacts with the OpenGWAS API which houses LD reference panels for the 5 super-populations in the 1000 genomes reference panel. But it only includes bi-allelic SNPs with a MAF > 0.01. The function will return an array of rsID's that are in the LD reference panel. It worked for a few of the SNPs that I selected for the CSV file. See simple example and its output below. Maybe this will work for you all too.

ieugwasr::ld_reflookup(c(rsid = "rs55999874", "rs62448339", "rs11975856", "rs11429065"), pop = "EUR") [1] "rs11975856" "rs55999874" "rs62448339"

I hope this helps some. Many thanks to you both again for bringing this issue to our attention.

— Reply to this email directly, view it on GitHubhttps://github.com/CBIIT/LDlinkR/issues/16#issuecomment-1098131733, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APZRJR4JMYKOTHLV6MVG7M3VE3L4TANCNFSM5SYWT3EA. You are receiving this because you commented.Message ID: @.***>

kdack commented 2 years ago

I am also having this issue. I notice the error is replicated using the online LDmatrix tool with build GRCH37 but not GRCH38, so it is build related but I am not sure why: all SNPs are found using ieugwasr::ld_reflookup(), and when I look on dbSNP all variants seem to have an 1000G entry and GRCH37 position?

snps.xlsx

timyers commented 2 years ago

It looks like the SNP rs4853736 from your list is causing the error. rs4853736 returns two results with different coordinates for 1000G in GRCh37 (chr2:191669926 and chr2:191669930) but only one result in GRCh38. This is causing the error to be thrown due to the logic used to catch query SNPs with more than one 1000G result. We apologize for the inconvenience. The development team is already planning to implement a patch as soon as possible that will address this issue. Thank you for reporting this!

fadista commented 2 years ago

I am having the same issue. Would be great to have a workaround. Many thanks for this fantastic tool.

timyers commented 2 years ago

A patch is in the works. I reviewed it yesterday. The dev team is wrapping up testing as I write and will proceed with production deployment if all looks good. They are aiming for the end of this week. Many thanks for your continued patience and interest in our tool.

timyers commented 2 years ago

The new release with a patch that addresses this issue is now available. Let us know if you encounter any problems.

CBIIT / LDlinkR

New error message: 'error: One or more query variants were not found in 1000G VCF file. ' #16