Open gaow opened 2 years ago
@gaow This is a very informative post. I am not sure the name sumstats_merger
is appropriate to LDtools
. Just like you mentioned, LDtools contains many other features. If we want to make a library focusing on one thing, would it be better to make a new repo (sumstats_merger
) to put all the code related with merging sumstats in it.
It looks like sumstats_merger
is a python version of MungeSumstats
.
Before doing this, I have the following questions.
The advantage of our sumstats_merger that I can think about is matching SNPs. Bigsnpr::snp_match
can only handle paired SNPs. We can handle unpaired SNPs and duplicated Positions.
About computing LD, ldstore
has two version, one is C++ version, and another is python version. I checked the code of python version, which I don't think it can handle large genotype data. The C++ version, it also using the chunk method, it looks like the chunk LD matrix will output to temp files. Then, read them into memory and merge them. Not sure I understand it right, Further confirmation is required.
I believe our method has speed advantage. Maybe people wan't care about it much.
If we want to make a library focusing on one thing
We need to define the "one thing". My starting point, by posting this in the xqtl-pipeline
repo, is that i would like to iron out all the summary stats extraction hassles relevant to analysis involved in this repo into a utility package. Whatever involved in this "one thing" should be packaged. LD related computations, in fact, should also be considered summary statistics (cor(genotype)
). Our repo offers a diverse variety of analysis; a tool able to support our repo should be reasonably powerful in many other more generic user cases.
I completely agree with you that we need to be clearer about the scope of the package. I will flush them out by updating the original post. In general you can ask "why not package?" or "why not combination of and ** packages" considering I've listed a few of them above. To answer it properly, we need to list the features of our proposed library, and argue feature by feature why other tools are not useful. At the high level it'd be one of these cases:
Additionally we may be able to come up with a much better user interface which is important, but "better" is subjective ...
RE LD: as pointed out , to compute LD correctly we need to adjust for covariates and GRM ... there are some other subtlety you might not realize at this point. That's why we should make the list (or even more formally write something up) before doing any coding. @changebio
I agree. To do that: First, we need to learn more about the code of these packages to find the advantages and disadvantages of them. Second, we need to write up our goal and think about the interface.
In ldstore
, Does it adjust the covariates and GRM?
In ldstore, Does it adjust the covariates and GRM?
not as far as i am aware of
@changebio moving our slack discussions here.
Objective
It is non-trivial to handle all these data merger and region extraction etc for integrative analysis. I would like to revisit and extend LDtools as a utility package to handle summary stats data in the context of data integration -- including fine-mapping, LDSC, PRS, TWAS.
Relevant existing tools
sumstats
orgwas summary statistics
there are various repos on githubAs a user, I would rather use ldsc code than LDtools because at least they are better tested (by application to multiple papers and dataset). I wish LDtools had used some of these code so we don't reinvent the wheel. However, what polyfun/ldsc has is not good enough.
Also, survey each application fine-mapping, LDSC, PRS, TWAS and comment on how they handle the issue.
Proposal
Let's take a look at polyfun/ldsc and other related codes, and rework LDtools if their implementation is better. LDtools also contains many other features. Here is the scope of the proposed extended tool:
For each of these steps we should discuss what others have done and how they are not good (either feature or computational speed limitations). And explain what we did. We should also review work related to summary stats QC and possibly include those features formally -- I see it here and there in different papers but I have not seen a systematic discussion on it.
We may want to rename
LDtools
tosumstats_merger
? Unlike most applications listed, I prefer a Python library though, not a command tool. And it should focus on one thing and one thing only: merge summary stats (association effects and LD) properly. We should take a look at each of the existing tools and carefully screen them; and only investigate the features relevant to summary stats merging.@changebio depending on your interest, but if this is a high quality package with a "simple" focus, we should aim to publish it. I believe it's going to be very useful, and can even determine. However there is lot of responsibility involved even to just do this "simple" task right. That means I will have zero tolerance of sloppiness in both the scientific and the engineering aspects if we decide to take this seriously. I think the most important is to identify and summarize various pitfalls when working with summary stats; then provide solution to some of them, if not all.