Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 15 forks source link

Option to specify effect allele as A1 instead. #160

Closed svenkatesh25 closed 2 months ago

svenkatesh25 commented 1 year ago

Can an option be added to the main function to consider A1 as the effect allele instead? I have attempted to this manually, but it is inelegant without manipulating scripts contained within the MungeSumstats library.

Al-Murphy commented 1 year ago

Hey!

The easiest way to do this is just to rename A1 and A2 after MSS outputs the formatted file. Two things on doing this in MSS:

  1. MSS uses reference datasets for numerous of it's checks (for example check for correct effect column directions) which all depend on A1 being the reference and A2 the alternative allele. So I could no incorporate such a change fully to the code.
  2. I don't agree about making the change anyway - the point of MSS was to help standardise the formatting of summary statistics including the meaning of A1 and A2. As described in our manuscript we chose to standardise A1 as the ref and A2 as the alt in line with other large scale standardisation efforts like IEU Open GWAS. So based on all this, I would want to add functionality to MSS to allow this as then formatted sumstats from MSS won't even correspond to one another

Also it is worth noting that a lot of downstream tools expect A1 to be the ref and A2 the alt like LDSC so I would say to really consider whether this is a route you want to take.

Hope this makes sense and you understand my point of view? I'm happy to chat about it more if you like.

Cheers, Alan.

zichengwang98 commented 10 months ago

I encountered a similar question about the definition of A1 and A2. In LDSC, A1 is defined as the effect allele and A2 is defined as the non-effect allele (see https://github.com/bulik/ldsc/issues/133 or ). However, in MungeSumstats, A2 is defined as the effect allele. Should I flip the column names of A1 and A2 before I use the munged summary statistics for LDSC?

Al-Murphy commented 10 months ago

Hi this was already answered here but see below for my explanation:

For LDSC, the required input format for a sumstats describe A1 as the effect columns and A2 as the other https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format. The term effect allele is confusing here since we, with MSS, mean effect allele to be the alt allele but they mean it to be the ref. You can see this if you look in their munge_sumstats.py file - https://github.com/bulik/ldsc/blob/master/munge_sumstats.py which shows other names for A1, one of which being reference. Thus A1 and A2 are the same for LDSC as we use in MSS.

Hope this clears things up for you!

Alan.

zichengwang98 commented 10 months ago

Thank you for your detailed response! I checked the source code of munge_sumstats.py, and indeed they specified A1=EFFECT_ALLELE=REFERENCE_ALLELE. In the same script, they also included the description: 'BETA': '[linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing)'.

My understanding is that LDSC assumes effect allele is the reference allele, and beta measures the effect of reference allele (A1). MungeSumstats assumes effect allele is the alternative allele, and beta measures the effect of alternative allele (A2). Thus, if I want to use the output of MungeSumstats as the input of LDSC, I have to multiply the beta and Z columns by -1 . If I don't care about the true "reference allele" in the reference panel, I can flip the column names of A1 and A2 (the estimates from LDSC are valid as long as beta measures the effect of A1). If I do nothing, the sign of rg would be the opposite but h2 would be the same.

Could you let me know which part is wrong? Thanks a lot for your time and patience!

Al-Murphy commented 10 months ago

Hi,

I see what you mean but I think this is going to have to be tested with running LDSC where we know there is an association and using the same sumstats just with A1 as the major and the minor allele and see which gives the correct result. I thought it would be easiest to do this with an example from the LDSC (sLDSC?) paper but I'm struggling to find any examples that aren't now behind a pay wall. I'll keep looking but let me know if you have any either?

zichengwang98 commented 10 months ago

Unfortunately, I don't have the GWAS summary statistics for SCZ and BIP that were used for the tutorial. I downloaded some munged summary statistics released by the same group instead. The columns in PASS_Alzheimers_Jansen2019.sumstats.gz (A1, A2, Z) are consistent with the columns in the original GWAS. According to the readme file, A1 is defined as the effect allele.

Please feel free to check whether other LDSC-format summary statistics were munged in the same way. I'm not very sure about this situation, and many people have the same question (see https://github.com/bulik/ldsc/issues/375).

My real concern is that for GWAS with column names EFFECT_ALLELE and OTHER_ALLELE (which is quite common), MungeSumstats would define A2 as the effect allele, and munge_sumstats.py from LDSC would define A1 as the effect allele (we know that from the script). Thus, it might be better to make sure that every GWAS is munged by the same software when using LDSC.

NathanSkene commented 10 months ago

If there is a clear difference from how munge_sumstats.py interprets A1 / A2, then we do need to deal with this. Thanks for pointing this out @zichengwang98. LDSC is one of the major use cases for MungeSumstats, so we need to be sure it's working correctly for it. Thoughts on options for dealing with this?

Al-Murphy commented 10 months ago

It's an easy fix to deal with but I'm not convinced MSS approach is wrong yet so I need to test running LDSC with both formats (i.e. with effect columns realting to 1. A1 and 2. A2). I think people have differing meanings for effect/major/minor/etc alleles so I want to make surethe format is right before changing anything. However, the issue is the google bucket where you download data to rerun the results from the LDSC paper appears to be down:

$ gsutil -u engaged-style-308012 cp gs://broad-alkesgroup-public-requester-pays/LDSCORE/LDSC_SEG_ldscores/Cahoy_1000Gv3_ldscores.tgz .
Copying gs://broad-alkesgroup-public-requester-pays/LDSCORE/LDSC_SEG_ldscores/Cahoy_1000Gv3_ldscores.tgz...
AccessDeniedException: 403 HttpError accessing <https://storage.googleapis.com/download/storage/v1/b/broad-alkesgroup-public-requester-pays/o/LDSCORE%2FLDSC_SEG_ldscores%2FCahoy_1000Gv3_ldscores.tgz?generation=1679508265270442&alt=media&userProject=engaged-style-308012>: response: <{'x-guploader-uploadid': 'ABPtcPpLkVONBmMaqeI6FvEVzlvcjpCV3dufC0hvz4bDb7lOvRMp0puGCi7Vz5_pXamCUNNMD6s', 'content-type': 'text/html; charset=UTF-8', 'date': 'Fri, 12 Jan 2024 07:46:27 GMT', 'vary': 'Origin, X-Origin', 'expires': 'Fri, 12 Jan 2024 07:46:27 GMT', 'cache-control': 'private, max-age=0', 'content-length': '70', 'server': 'UploadServer', 'alt-svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000', 'status': '403'}>, content <The billing account for the owning project is disabled in state closed>

Key error message: The billing account for the owning project is disabled in state closed

Others seem to have noted this too, not sure of another way around this but I'll keep working on it.

zichengwang98 commented 10 months ago

Thank you, Alan and Nathan, for your time and effort! I found this email from the LDSC user group. Raymond Walters from the Neale lab confirmed that in LDSC, A1 should be the effect allele/minor allele. I hope this information helps.

Best, Zicheng

Al-Murphy commented 9 months ago

Hi,

Thanks for this, I will update MSS when using LDSC format so that A1 and A2 are flipped to match this. I'll comment here when the update is done. However, I don't think this will have an effect on people's analysis, I ran a test on the cell type specific tutorial from LDSC with no MSS formatting, the current MSS formatting and flipping A1 and A2 and found the result was the exact same:

**NO MSS**                              
Name    Coefficient     Coefficient_std_error   Coefficient_P_value                             
Neuron  1.33021438504885e-11    2.8486851192986768e-09  0.498137116566384                               
Oligodendrocyte -2.3708481905287363e-09 2.2878529147566418e-09  0.8499634147751827                              
Astrocyte       -2.586495108996289e-09  2.3816110423640045e-09  0.8612665892065984                              

**MSS**                             
Name    Coefficient     Coefficient_std_error   Coefficient_P_value                             
Neuron  1.33021438504885e-11    2.8486851192986768e-09  0.498137116566384                               
Oligodendrocyte -2.3708481905287363e-09 2.2878529147566418e-09  0.8499634147751827                              
Astrocyte       -2.586495108996289e-09  2.3816110423640045e-09  0.8612665892065984                              

**FLIPPED**                             
Name    Coefficient     Coefficient_std_error   Coefficient_P_value                             
Neuron  1.33021438504885e-11    2.8486851192986768e-09  0.498137116566384                               
Oligodendrocyte -2.3708481905287363e-09 2.2878529147566418e-09  0.8499634147751827                              
Astrocyte       -2.586495108996289e-09  2.3816110423640045e-09  0.8612665892065984

I guess this may be because, although we name A1 and A2 the opposite way to LDSC (our A1 is the reference genome, i.e. not the effect allele whereas A1 for LDSC is the effect allele, the non-reference genome), the effect column (Z) is the same direction. If you notice any differences @zichengwang98, please do let me know though.

Al-Murphy commented 9 months ago

Updated in v1.11.3 available from github master branch now or from Bioconductor devel in a few days

zichengwang98 commented 9 months ago

Hi Alan,

I do not expect to observe a difference in stratified-LDSC because the enrichment coefficient τ is not related to the sign of the Z score in the summary statistics. It measures whether we observe enrichment (higher than expected partitioned h2) or depletion (lower than expected partitioned h2) in the tissue type. As Raymond explained in the email, flipped signed-statistics would not have an impact on the h2 estimation.

To show the consequences of the flipped signed statistics, you would need to perform the genetic correlation (rg) analysis, following this tutorial. Reversing your A1/A2 in one of the traits will flip the sign of the correlation. If they are reversed in both traits, then rg would be back to normal.

Since the original SCZ and BIP summary statistics are not available, here's what you can try:

  1. Find a pair of genetically correlated traits.
  2. Munge both traits with munge_sumstats.py from LDSC (ideally without specifying any flags), and calculate the rg between the traits.
  3. Munge one of the traits with MSS (ideally without specifying any flags), then re-calculate the rg_MSS.

I would expect to obtain an rg_MSS with flipped sign from step 3. The magnitude and p-value of rg_MSS should be the same.

Best, Zicheng

Al-Murphy commented 9 months ago

That makes sense, it would be great if you could try this with both the old and new MSS LDSC approaches to show the difference @zichengwang98?

Thanks, Alan.

Al-Murphy commented 2 months ago

closing because of inactivity, @zichengwang98 if you get back to testing this, please reopen. However, I still believe the current specification of A1/A2 matches LDSC, see here.

XuanxiaoLi commented 2 days ago

I encountered a similar question about the definition of A1 and A2. In LDSC, A1 is defined as the effect allele and A2 is defined as the non-effect allele (see bulik/ldsc#133 or ). However, in MungeSumstats, A2 is defined as the effect allele. Should I flip the column names of A1 and A2 before I use the munged summary statistics for LDSC? I have a question: it was mentioned here that "in MungeSumstats, A2 is defined as the effect allele." How is this determined? In Munge_sumstats.py, I found explanations regarding A1 and A2, and here A2 is defined as the non-effect allele, which makes me a bit confused.

截屏2024-11-07 09 22 40