cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium
https://cumc.github.io/xqtl-protocol/
MIT License
41 stars 43 forks source link

`LDtools` package extension #153

Open gaow opened 2 years ago

gaow commented 2 years ago

@changebio moving our slack discussions here.

Objective

It is non-trivial to handle all these data merger and region extraction etc for integrative analysis. I would like to revisit and extend LDtools as a utility package to handle summary stats data in the context of data integration -- including fine-mapping, LDSC, PRS, TWAS.

Relevant existing tools

  1. LDtools developed by @changebio for our analysis pipelines.
  2. Polyfun software has some utility functions to compute LD by chunks, with PLINK input format https://github.com/omerwe/polyfun/blob/master/finemapper.py#L476 as well as with bgen format in which case ldstore is used https://github.com/omerwe/polyfun/blob/master/finemapper.py#L385
  3. bigsnpr::snp_match
  4. https://rdrr.io/github/MichelNivard/GenomicSEM/man/sumstats.html
  5. https://github.com/precimed/python_convert
  6. https://github.com/matthijsz/pysumstats
  7. https://bioconductor.riken.jp/packages/3.14/bioc/vignettes/MungeSumstats/inst/doc/MungeSumstats.html
  8. Google search sumstats or gwas summary statistics there are various repos on github

As a user, I would rather use ldsc code than LDtools because at least they are better tested (by application to multiple papers and dataset). I wish LDtools had used some of these code so we don't reinvent the wheel. However, what polyfun/ldsc has is not good enough.

Also, survey each application fine-mapping, LDSC, PRS, TWAS and comment on how they handle the issue.

Proposal

Let's take a look at polyfun/ldsc and other related codes, and rework LDtools if their implementation is better. LDtools also contains many other features. Here is the scope of the proposed extended tool:

  1. Support VCF summary statistics format: convert and combine summary stats in VCF format (possibly through rpy2 to the gwas_vcf R package)
  2. Merge summary stats and make sign flips deserves a comprehensive discussion
  3. Compute LD for given region, adjusting for covariates and GRM ; need to support VCF, bgen and plink
  4. Built-in liftover for summary stats or LD index
  5. Other data integration utility features we use repeatedly in xqtl-pipeline, as we move forward with the analysis

For each of these steps we should discuss what others have done and how they are not good (either feature or computational speed limitations). And explain what we did. We should also review work related to summary stats QC and possibly include those features formally -- I see it here and there in different papers but I have not seen a systematic discussion on it.

We may want to rename LDtools to sumstats_merger? Unlike most applications listed, I prefer a Python library though, not a command tool. And it should focus on one thing and one thing only: merge summary stats (association effects and LD) properly. We should take a look at each of the existing tools and carefully screen them; and only investigate the features relevant to summary stats merging.

@changebio depending on your interest, but if this is a high quality package with a "simple" focus, we should aim to publish it. I believe it's going to be very useful, and can even determine. However there is lot of responsibility involved even to just do this "simple" task right. That means I will have zero tolerance of sloppiness in both the scientific and the engineering aspects if we decide to take this seriously. I think the most important is to identify and summarize various pitfalls when working with summary stats; then provide solution to some of them, if not all.

changebio commented 2 years ago

@gaow This is a very informative post. I am not sure the name sumstats_merger is appropriate to LDtools. Just like you mentioned, LDtools contains many other features. If we want to make a library focusing on one thing, would it be better to make a new repo (sumstats_merger) to put all the code related with merging sumstats in it. It looks like sumstats_merger is a python version of MungeSumstats. Before doing this, I have the following questions.

  1. Why don’t we use MungeSumstats? What is the drawback of it?
  2. What pitfalls do we want to solve?
  3. Who will use it?

The advantage of our sumstats_merger that I can think about is matching SNPs. Bigsnpr::snp_match can only handle paired SNPs. We can handle unpaired SNPs and duplicated Positions.

changebio commented 2 years ago

About computing LD, ldstore has two version, one is C++ version, and another is python version. I checked the code of python version, which I don't think it can handle large genotype data. The C++ version, it also using the chunk method, it looks like the chunk LD matrix will output to temp files. Then, read them into memory and merge them. Not sure I understand it right, Further confirmation is required. I believe our method has speed advantage. Maybe people wan't care about it much.

gaow commented 2 years ago

If we want to make a library focusing on one thing

We need to define the "one thing". My starting point, by posting this in the xqtl-pipeline repo, is that i would like to iron out all the summary stats extraction hassles relevant to analysis involved in this repo into a utility package. Whatever involved in this "one thing" should be packaged. LD related computations, in fact, should also be considered summary statistics (cor(genotype)). Our repo offers a diverse variety of analysis; a tool able to support our repo should be reasonably powerful in many other more generic user cases.

I completely agree with you that we need to be clearer about the scope of the package. I will flush them out by updating the original post. In general you can ask "why not package?" or "why not combination of and ** packages" considering I've listed a few of them above. To answer it properly, we need to list the features of our proposed library, and argue feature by feature why other tools are not useful. At the high level it'd be one of these cases:

  1. Complete lack of a feature
  2. Incomplete consideration of scenarios
  3. Inability to handle multiple and not just two lists of variants
  4. Computational challenge with large samples
  5. Lack of signs of good maintenance and unit tests -- which is the major reason I dont want to use any of them!!

Additionally we may be able to come up with a much better user interface which is important, but "better" is subjective ...

gaow commented 2 years ago

RE LD: as pointed out , to compute LD correctly we need to adjust for covariates and GRM ... there are some other subtlety you might not realize at this point. That's why we should make the list (or even more formally write something up) before doing any coding. @changebio

changebio commented 2 years ago

I agree. To do that: First, we need to learn more about the code of these packages to find the advantages and disadvantages of them. Second, we need to write up our goal and think about the interface.

In ldstore, Does it adjust the covariates and GRM?

gaow commented 2 years ago

In ldstore, Does it adjust the covariates and GRM?

not as far as i am aware of