bulik / ldsc

LD Score Regression (LDSC)
GNU General Public License v3.0
628 stars 340 forks source link

'Could not find a signed summary statistic column.') #405

Open yuupei opened 12 months ago

yuupei commented 12 months ago

Hi, I am new here.

when i tried to run my data I encounter this issue everytime. ERROR converting summary statistics:

Traceback (most recent call last): File "./munge_sumstats.py", line 611, in munge_sumstats 'Could not find a signed summary statistic column.') ValueError: Could not find a signed summary statistic column.

Conversion finished at Thu Sep 28 07:30:17 2023 Total time elapsed: 0.0s Traceback (most recent call last): File "./munge_sumstats.py", line 745, in munge_sumstats(parser.parse_args(), p=True) File "./munge_sumstats.py", line 611, in munge_sumstats 'Could not find a signed summary statistic column.') ValueError: Could not find a signed summary statistic column.

Can I know what does this mean and what should I do?

Thank you

Sabor117 commented 8 months ago

Hi there Yuupei,

Did you ever solve this particular issue with LDSC? I have been encountering it myself now and am a bit stuck with troubleshooting.

Googling this particular error provided a few different results, but none of them seemed to be relevant to my own issue, so I would also appreciate some help with this.

Here is my log file:

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./munge_sumstats.py \
--out /scratch/project_2007428/projects/prj_001_cost_gwas/processing/ldsc_intermediate_files//UKB_ALL_ALL_ldsc_input_ALL_ALL_ldsc_munged \
--merge-alleles /scratch/project_2007428/users/Zhiyu/Tool/ldsc/Ref/w_hm3.snplist \
--sumstats /scratch/project_2007428/projects/prj_001_cost_gwas/processing/ldsc_intermediate_files/UKB_ALL_ALL_ldsc_input.txt.gz 

ERROR converting summary statistics:

Traceback (most recent call last):
  File "/projappl/project_2007428/software/ldsc/munge_sumstats.py", line 611, in munge_sumstats
    'Could not find a signed summary statistic column.')
ValueError: Could not find a signed summary statistic column.

Conversion finished at Tue Jan 16 18:39:59 2024
Total time elapsed: 0.0s

And here is the header of my input file (in R):

> head(sumstats)
rsid a1 a0      n         p       beta1
rs687513  A  G 212765 0.3625225 -0.00456243
rs6577165  T  A 212765 0.3173474  0.41311700
rs7529831  A  C 212765 0.8126003 -0.02278910
rs6577221  T  C 212765 0.6074954  0.04990290
rs12733701  G  A 212765 0.6059015 -0.01776350
rs17124137  A  C 212765 0.1950065  0.13357700
> summary(sumstats)
     rsid                a1                 a0                  n         
 Length:1230617     Length:1230617     Length:1230617     Min.   :212765  
 Class :character   Class :character   Class :character   1st Qu.:212765  
 Mode  :character   Mode  :character   Mode  :character   Median :212765  
                                                          Mean   :212765  
                                                          3rd Qu.:212765  
                                                          Max.   :212765  
       p              beta1           
 Min.   :0.0000   Min.   :-1.8848700  
 1st Qu.:0.2041   1st Qu.:-0.0036829  
 Median :0.4561   Median : 0.0000183  
 Mean   :0.4681   Mean   : 0.0000265  
 3rd Qu.:0.7249   3rd Qu.: 0.0037395  
 Max.   :1.0000   Max.   : 1.4071600  

As far as I can make out, other examples of this error mention that the file needs to be white-space delimited (mine is tab delimited), it might be to do with parsing the arguments incorrectly (but it seems like LDSC is working correctly), the presence of NAs or NaN in the file (but as you can see there aren't), and a potential mis-match between rsID and data (but I don't see how that could be happening either).

Essentially, as far as I'm aware my input matches the requirements for munge_sumstats, but it's still not quite working. Any help would be appreciated!

DavisCammann commented 8 months ago

Hello,

The mungesumstats.py file uses several lists of different names for column headers that are commonly used in GWAS summary statistics files. Lines 85-96 in the script shows what headers are considered acceptable for the effect size (BETA or Odds-Ratio) of your GWAS.

   # SIGNED STATISTICS
    'ZSCORE': 'Z',
    'Z-SCORE': 'Z',
    'GC_ZSCORE': 'Z',
    'Z': 'Z',
    'OR': 'OR',
    'B': 'BETA',
    'BETA': 'BETA',
    'LOG_ODDS': 'LOG_ODDS',
    'EFFECTS': 'BETA',
    'EFFECT': 'BETA',
    'SIGNED_SUMSTAT': 'SIGNED_SUMSTAT',

If your header doesn't match, you need to specify the name of your effect size column using the argument --signed-sumstats. There are additional arguments similar to this for specifying the names of other columns, such as --snp for specifying your RSID column. For example, in @Sabor117 's case, they would need to use the argument like this: --signed-sumstats beta1

Sabor117 commented 8 months ago

Hi there!

Thanks for getting back to this question, I can confirm that the issue my end was that I had not correctly specified the --signed-sumstats column (it actually was not initially clear to me that this was meant to mean the "effect size" of the given allele).

Since making my post last week, I adjusted the summary stats and my LDSC code and it worked with the following:

./munge_sumstats.py \
--signed-sumstats zscore1,0 \
--out /scratch/project_2007428/projects/prj_001_cost_gwas/processing/ldsc_intermediate_files//UKB_ALL_ALL_ldsc_munged \
--merge-alleles /scratch/project_2007428/users/Zhiyu/Tool/ldsc/Ref/w_hm3.snplist \
--a1-inc  \
--N-col n \
--a1 a1 \
--a2 a0 \
--snp rsid \
--sumstats /scratch/project_2007428/projects/prj_001_cost_gwas/processing/ldsc_intermediate_files/UKB_ALL_ALL_ldsc_input.txt.gz \
--p p 

Note, I changed my effect sizes from betas into Z-scores (beta1 became zscore1) as the documentation for LDSC seemed to suggest that it preferred using Z-scores or ORs to betas. I also included the --a1-inc flag as my Z-scores were always related to the A1 (which I hope is the correct usage).

Thanks again for the response here!