Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

How can I get N column for the subsequent ldsc analysis? #115

Closed zh-zhang1984 closed 2 years ago

zh-zhang1984 commented 2 years ago

Hi, everyone I use the following to format the VCF file for ldsc analysis, however, I found there is no N column; Can anyone help me?

datasets <- MungeSumstats::import_sumstats(
  ids = ids,
  vcf_dir = "/Users/zhangzhongheng/Documents/2022/GWAS_sepsis/raw/",
  ldsc_format =T,
  save_dir = "/Users/zhangzhongheng/Documents/2022/GWAS_sepsis/clean",
  compute_z = T, compute_n = "ldsc"
)
dd <- read_tsv("/Users/zhangzhongheng/Documents/2022/GWAS_sepsis/clean/bbj-a-46/bbj-a-46.tsv.gz")
Rows: 5770727 Columns: 13                                                                                            
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): SNP, A1, A2, FILTER
dbl (9): CHR, BP, END, FRQ, BETA, SE, LP, P, Z

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> dd
# A tibble: 5,770,727 × 13
   SNP           CHR     BP A1    A2       END FILTER    FRQ     BETA      SE    LP     P      Z
   <chr>       <dbl>  <dbl> <chr> <chr>  <dbl> <chr>   <dbl>    <dbl>   <dbl> <dbl> <dbl>  <dbl>
 1 rs28527770      1 751756 T     C     751756 PASS   0.152   0.00149 0.00536 0.107 0.781  0.278
 2 rs3094315       1 752566 G     A     752566 PASS   0.844  -0.00145 0.00520 0.107 0.781 -0.278
 3 rs3115860       1 753405 C     A     753405 PASS   0.838  -0.00156 0.00559 0.108 0.780 -0.280
 4 rs117086422     1 845635 C     T     845635 PASS   0.140   0.00575 0.00582 0.490 0.324  0.987
 5 rs28612348      1 846078 C     T     846078 PASS   0.142   0.00581 0.00583 0.496 0.319  0.997
 6 rs4475691       1 846808 C     T     846808 PASS   0.141   0.00524 0.00541 0.478 0.333  0.969
 7 rs950122        1 846864 G     C     846864 PASS   0.141   0.00525 0.00542 0.478 0.333  0.969
 8 rs3905286       1 847228 C     T     847228 PASS   0.139   0.00516 0.00548 0.461 0.346  0.942
 9 rs28407778      1 847491 G     A     847491 PASS   0.139   0.00514 0.00548 0.459 0.348  0.939
10 rs79932038      1 847983 C     T     847983 PASS   0.0318 -0.00544 0.0122  0.184 0.655 -0.447
# … with 5,770,717 more rows
Al-Murphy commented 2 years ago

Hey,

I believe if you tried to calculate N for your sumstats using MSS you would have gotten the below warning:

WARNING: Neff column could not be calculated as the columns N_CAS & N_CON were not found in the dataset

This is because a case and control N value per SNP is necessary to calculate N this way. Since this data isn't available it can't be imputed. I think your best bet is to contact the authors of the GWAS to get (at the very least) a population N number to add to the data so you can run ldsc (a N value per SNP would be better)

zh-zhang1984 commented 2 years ago

Thank you for your hints; then is there a method to incorporate this N into the function pipeline; Suppose I get the N from literature / authors , and want to include N in the MungeSumstats::import_sumstats

Al-Murphy commented 2 years ago

Yep so you can use the compute_n parameter by setting it to the integer amount for N and the column will be created but as I mentioned this should be as a last resort since it isn't necessarily the true N for each SNP so you will lose precision. See the parameter documentation:

@param compute_n Whether to impute N. Default of 0 won't impute, any other  integer will be imputed as the N (sample size) for every SNP in the dataset.  **Note** that imputing the sample size for every SNP is not correct and  should only be done as a last resort. N can also be inputted with "ldsc",  "sum", "giant" or "metal" by passing one of these for this field or a vector of multiple. Sum and an integer value creates an N column in the output  whereas giant, metal or ldsc create an Neff or effective sample size. If  multiples are passed, the formula used to derive it will be indicated.
Al-Murphy commented 2 years ago

Closing as I believe your question has been answered. Feel free to reopen if not.

Thanks, Alan.