Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Take name from input data and use it to save the logs #131

Closed AMCalejandro closed 1 year ago

AMCalejandro commented 1 year ago

Is your feature request related to a problem? Please describe. It would be nice that when I set log folder, to save log_mungesumstats_msgs and log_folder_ind, format_sumstats took the name from the input data to save the logs

Describe the solution you'd like I am running mungesumstats for a bunch of gwas, and I would like to see the logs of all of them and run a small script in the bg

A code example would be:

map2(other_gwas$file_contents, 
     names(other_gwas$file_contents), function(gwas, gwasname) {

  formatted_path = paste0(path_othergwas_qc,
                          tools::file_path_sans_ext(gwasname),
                          "_ldscQCed.tsv") # Match the other formats # ldscQCed.tsv
  print(formatted_path)

  check<-
    MungeSumstats::format_sumstats(path = gwas,
                                 ref_genome="GRCh37",
                                 snp_ids_are_rs_ids=FALSE,
                                 bi_allelic_filter = FALSE,
                                 allele_flip_frq = FALSE,
                                 convert_n_int = TRUE,
                                 impute_beta = TRUE,

                                 #return_data = TRUE,
                                 #return_format = "data.table",

                                 save_format='LDSC',
                                 nThread = 30,
                                 save_path = formatted_path,
                                 force_new = TRUE,
                                 INFO_filter = 0.8)

                                 log_folder = path_othergwas_qc,
                                 log_mungesumstats_msgs = TRUE,
                                 log_folder_ind = TRUE)
}
    )

The input data

> str(other_gwas)
tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
 $ filespath    : chr [1:2] "/home/rstudio/cellType_progGWAS/data/other_gwas/raw_gwas/BMI_LockeUKBiobank2018.txt" "/home/rstudio/cellType_progGWAS/data/other_gwas/raw_gwas/clozuk_pgc2.meta.sumstats.txt"
 $ file_contents:List of 2
  ..$ BMI_LockeUKBiobank2018.txt   : tibble [2,336,269 × 10] (S3: tbl_df/tbl/data.frame)
  .. ..$ CHR : num [1:2336269] 7 12 4 4 4 4 3 4 4 3 ...
  .. ..$ POS : num [1:2336269] 9.24e+07 1.27e+08 2.16e+07 1.36e+06 3.72e+07 ...
  .. ..$ SNP : chr [1:2336269] "rs10" "rs1000000" "rs10000010" "rs10000012" ...
  .. ..$ A1  : chr [1:2336269] "A" "A" "T" "C" ...
  .. ..$ A2  : chr [1:2336269] "C" "G" "C" "G" ...
  .. ..$ Freq: num [1:2336269] 0.0643 0.2219 0.5086 0.8634 0.7708 ...
  .. ..$ BETA: num [1:2336269] 0.0013 0.0001 -0.0001 0.0047 -0.0061 0.0041 -0.0055 -0.0047 -0.0013 0.0029 ...
  .. ..$ SE  : num [1:2336269] 0.0042 0.0021 0.0016 0.0025 0.0021 0.0021 0.0017 0.0018 0.0023 0.0024 ...
  .. ..$ P   : num [1:2336269] 0.75 0.96 0.94 0.057 0.0033 0.048 0.0013 0.0072 0.57 0.23 ...
  .. ..$ N   : num [1:2336269] 598895 689928 785319 692463 687856 ...
  ..$ clozuk_pgc2.meta.sumstats.txt: tibble [8,171,061 × 10] (S3: tbl_df/tbl/data.frame)
  .. ..$ SNP : chr [1:8171061] "10:100968448:T:AA" "10:101574552:A:ATG" "10:10222597:AT:A" "10:102244152:A:AG" ...
  .. ..$ Freq: num [1:8171061] 0.352 0.449 0 0.201 0.19 ...
  .. ..$ CHR : num [1:8171061] 10 10 10 10 10 10 10 10 10 10 ...
  .. ..$ BP  : num [1:8171061] 1.01e+08 1.02e+08 1.02e+07 1.02e+08 1.02e+08 ...
  .. ..$ A1  : chr [1:8171061] "t" "a" "a" "a" ...
  .. ..$ A2  : chr [1:8171061] "aa" "atg" "at" "ag" ...
  .. ..$ OR  : num [1:8171061] 1.002 0.989 1 0.997 0.993 ...
  .. ..$ SE  : num [1:8171061] 0.01 0.0097 0.01 0.0114 0.0114 0.0101 0.0104 0.0116 0.0112 0.0104 ...
  .. ..$ P   : num [1:8171061] 0.812 0.259 0.978 0.77 0.543 ...
  .. ..$ N   : num [1:8171061] 105318 105318 105318 105318 105318 ...
 $ makenames    : chr [1:2] "BMI_LockeUKBiobank2018.txt" "clozuk_pgc2.meta.sumstats.txt"

In this context, the log files would be rewritten every step of the mapping, which is unfortunate

Thanks

Al-Murphy commented 1 year ago

Not sure I understand what exactly the ask is here, could you try explaining in a different way? Or even better make a Pull Request with the change implemented? I'm not sure when I would have time to implement an enhancement change like this myself.

AMCalejandro commented 1 year ago

What I am saying is that the logs derived from the _log_mungesumstatsmsgs, and _log_folderind in format_sumstats() have a hard coded name.

What would be nice is that the function takes the filename from the input ( similar to what _savepath argument does), and write the logs using the same string than _savepath (without file extension), as the prefix to name the log files.

Otherwise, the files get overwritten if you iteratively format some sumstats

Al-Murphy commented 1 year ago

Yep that makes sense, thanks! I'll look to add this in the next development cycle.

Cheers, Alan.

Al-Murphy commented 1 year ago

Hey!

So got some time to have a look at this and just two things. Firstly, you can control where the log messages are stored using the log_folder parameter. The documentation states:

Filepath to the directory for the log files and the log of MungeSumstats messages to be stored. Default is a temporary directory.

I think this should remain separate to the save path parameter as I know of users who like to store these separately. However, I agree the hardcoded names might not be the best solution. So I have updated the code (v1.7.1) so the name of the log files (log messages and log outputs) are the same as the name of the file specified in the save path parameter with the extension '_log_msg.txt' and '_log_output.txt' respectively. I think taking the name from the file name in the save path makes more sense than the input path (remember sumstats in memory can also be passed as input rather than just paths).

Let me know if this doesn't answer the issues you were noting?

Cheers, Alan.

AMCalejandro commented 1 year ago

Yeah that is good, I will test it shortly. Thanks Alan.