LDMatrix/LDPair Optimization Opportunity

zackgomez commented 2 years ago

Hi,

I am part of a team working on an application that would like to compute LDpair R_2 and D' values for many thousands of pairs of SNPs. Currently we are using the LDMatrix method with a batch of 1000 to reduce the number of requests. This is much faster than issuing 1000 LDpair requests but still takes quite some time, and is wasting resources computing the linkage between many of the pairs that we don't care about.

I have two potential contributions to this project to help with our use case.

LDMatrix makes serial database calls to mongodb for each snp passed. For example https://github.com/CBIIT/nci-webtools-dceg-linkage/blob/master/LDlink/LDmatrix.py#L96 is called for every SNP passed here https://github.com/CBIIT/nci-webtools-dceg-linkage/blob/master/LDlink/LDmatrix.py#L164 These calls to mongo could be batched to dramatically improve performance and efficiency.
Introduce a batch LDPair API. This is an API change instead of just an implementation change but directly implements the use case needed. It would be a REST only API that takes a set of say 500 pairs (max 1000 SNPs) and returns both the R_2 and D' for the inputs.

Are you interested in taking either of these changes upstream?

kvnjng commented 2 years ago

Hi @zackgomez.

Thank you for your interest in LDlink and feedback!

Both sound doable. You are correct, a batch Mongo query for LDmatrix can improve performance a bit. Additionally, supporting a batch LDpair API can be made possible by introducing a POST API route that can can allow for such inputs.

I created internal trackers for these two suggestions. We'll take a look and get back to you.

Kevin

zackgomez commented 2 years ago

I have cloned the repository and got the mongodb populated according to the README in scripts/dbSNP update. However, it looks like I need to populate a data/ directory with some vcf and other files. Are there instructions on how to do so?

Specifically for LDpair and LDmatrix it looks like I need the files ending with phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz for the vcf_dir, and the some files in the pop_dir.

I browsed the ftp server and found this directory https://ftp.ncbi.nih.gov/1000genomes/ftp/phase3/ but couldn't find the specific files.

kvnjng commented 2 years ago

Hi @zackgomez,

The latest GRCh37 1000G dataset can be found here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

The population files (located in pop_dir) are created by subsetting each population's sample variables from:http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel

I've uploaded the population files (subsetted population sample variables) here: https://github.com/CBIIT/nci-webtools-dceg-linkage/tree/master/LDlink/files/1000G_population_samples

zackgomez commented 2 years ago

I was able to get the server up and running due to your instructions- thank you very much.

I did some light profiling of the ldmatrix endpoint and agree the serial mongo db calls are not the issue. The majority of the time is spent computing the pairwise statistics. This makes a lot of sense as it is doing some N^2 computation.

I plan to prototype a batch LDpair endpoint and see what the performance improvement possibilities are.

kvnjng commented 2 years ago

@zackgomez Happy to help! I will close this issue for now. Look out for new updates to LDlink coming soon. Your batch LDpair suggestion has been put on our roadmap for a future release.

CBIIT / nci-webtools-dceg-linkage

LDMatrix/LDPair Optimization Opportunity #159