Closed zackgomez closed 2 years ago
Hi @zackgomez.
Thank you for your interest in LDlink and feedback!
Both sound doable. You are correct, a batch Mongo query for LDmatrix can improve performance a bit. Additionally, supporting a batch LDpair API can be made possible by introducing a POST API route that can can allow for such inputs.
I created internal trackers for these two suggestions. We'll take a look and get back to you.
I have cloned the repository and got the mongodb populated according to the README in scripts/dbSNP update. However, it looks like I need to populate a data/ directory with some vcf and other files. Are there instructions on how to do so?
Specifically for LDpair and LDmatrix it looks like I need the files ending with phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz
for the vcf_dir, and the some files in the pop_dir.
I browsed the ftp server and found this directory https://ftp.ncbi.nih.gov/1000genomes/ftp/phase3/ but couldn't find the specific files.
Hi @zackgomez,
The latest GRCh37 1000G dataset can be found here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
The population files (located in pop_dir) are created by subsetting each population's sample variables from:http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
I've uploaded the population files (subsetted population sample variables) here: https://github.com/CBIIT/nci-webtools-dceg-linkage/tree/master/LDlink/files/1000G_population_samples
I was able to get the server up and running due to your instructions- thank you very much.
I did some light profiling of the ldmatrix endpoint and agree the serial mongo db calls are not the issue. The majority of the time is spent computing the pairwise statistics. This makes a lot of sense as it is doing some N^2 computation.
I plan to prototype a batch LDpair endpoint and see what the performance improvement possibilities are.
@zackgomez Happy to help! I will close this issue for now. Look out for new updates to LDlink coming soon. Your batch LDpair suggestion has been put on our roadmap for a future release.
Hi,
I am part of a team working on an application that would like to compute LDpair R_2 and D' values for many thousands of pairs of SNPs. Currently we are using the LDMatrix method with a batch of 1000 to reduce the number of requests. This is much faster than issuing 1000 LDpair requests but still takes quite some time, and is wasting resources computing the linkage between many of the pairs that we don't care about.
I have two potential contributions to this project to help with our use case.
Are you interested in taking either of these changes upstream?