Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Cant access the output #82

Closed soren-rand closed 2 years ago

soren-rand commented 2 years ago

Hi fellas I have been trying to wrangle and tame three summary statistics to a reference genome, and I was extremely relieved when I found my way to your tool. I am very impressed by all the ability to do liftover, annotation and harmonisation in one step. Just what we all needed in this jungle of data! 🙏🏻 I can almost now taste the satisfaction of using this tool, but unfortunately, I haven't been able to access the output data. Reading the documentation has pointed me to look in the direction of tempdir() - however, this made me neither smarter or richer in data. I think its fair to say I'm no genius in the syntax of using the tempdir, or how to redirect or recover any files ending there - and consulting my smarter colleagues didn't help either. So my journey has led me to here to GitHub. Can anyone provide me with an idiot-proof explanation of how to access my the output? I have tried

save_path = "/Users/sran/Documents/file.tsv.gz"

and

save_path= tempfile(pattern="file_harm",
         tmpdir="/Users/sran/Documents",
         fileext=".tsv.gz")

The logs are printing and seems to work 😅

Best of wishes Soren

Al-Murphy commented 2 years ago

Hey Soren, glad to hear you found the package useful!

The save_path parameter in format_sumstats() gives the location and file name of where the reformatted sumstats will be saved. Note that you don't have to use a temporary directory and you can just save it wherever you like under whatever name. When your R session is exited/restarted the temporary directory will be deleted so this may be why you are having issues locating your reformatted file? I believe the file should be saved under /Users/sran/Documents/file.tsv.gz though if you ran:

reformatted <- format_sumstats(
         path = pthToRawSumstats,
         save_path = "/Users/sran/Documents/file.tsv.gz"
     )

To test this on a small, quick example try running one of the example datasets in the package and saving the results to a specific location. One example could be:

eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
        package = "MungeSumstats"
   )

reformatted <- format_sumstats(
            path = eduAttainOkbayPth,
            ref_genome = "GRCh37",
            save_path = "~/Downloads/test.tsv.gz" #or some other location
)

Your results should be saved to ~/Downloads/test.tsv.gz.

You can also read it straight back in R to inspect with data.table using the same link:

fread("~/Downloads/test.tsv.gz")" #or some other location
        )
soren-rand commented 2 years ago

Hi Alan! Thank you for your reply. It was super useful. I ran the example you mentioned, which worked like a charm. It made me revise my code with the vignette. Now it turns out, I get a error in the log (and still no reformatted sumstats):

Writing in tabular format ==> /Users/sran/Documents/log/snp_not_found_from_chr_bp.tsv.gz
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

My input file looks like this (tab-separated):

SNP CHR BP  A2  A1  FRQ BETA    SE  P   N
NA  1   108391  G   A   0.002195    -0.0225 0.1382  0.8704  213990
rs74337086  1   115637  A   G   0.002436    0.0146  0.1341  0.9133  213990
rs76388980  1   528642  A   G   0.003417    0.2535  0.1163  0.02931 213990

And I have pretty much every option in the vignette in play:

reformatted <- format_sumstats(
  path = "/Users/sran/Documents/qc_and_harmonisation.txt",
  ref_genome = "GRCh37",
  convert_ref_genome = NULL,
  convert_small_p = F,
  compute_z = F,
  force_new_z = F,
  compute_n = 0,
  convert_n_int = F,
  INFO_filter = 0.8,
  FRQ_filter = 0.001,
  pos_se = T,
  effect_columns_nonzero = T,
  N_std = 5,
  N_dropNA = T,
  rmv_chr = remove,
  rmv_chrPrefix = F,
  on_ref_genome = T,
  strand_ambig_filter = T,
  allele_flip_check = T,
  allele_flip_drop = T,
  allele_flip_z = T,
  allele_flip_frq = T,
  bi_allelic_filter = T,
  snp_ids_are_rs_ids = T,
  remove_multi_rs_snp = T,
  frq_is_maf = T,
  sort_coordinates = T,
  nThread = 4,
  save_path = "/Users/sran/Documents/qc_harm_done.tsv.gz",
  write_vcf = F,
  tabix_index = F,
  return_data = F,
  ldsc_format = F,
  log_folder_ind = T,
  log_mungesumstats_msgs = T,
  log_folder = "/Users/sran/Documents/log",
  imputation_ind = T,
  force_new = F)

Do you have any idea where I might go wrong? Cheers lad, I appreciate the support!🙏🏻

From Copenhagen with the best wishes Soren

Al-Murphy commented 2 years ago

Ah okay that is odd, it appears the issue is occurring when writing out a log file on the SNPs that were removed when because they weren't on the reference genome and couldn't be found based on the specified chromosome and base-pair position: snp_not_found_from_chr_bp.tsv.gz. If I was to guess the reason, I think it could be to do with the NA values in the SNP column so when you try join information based on the SNP column, you get join errors.

Quick fix might be to turn off saving of these logs by setting log_folder_ind=FALSE. However I would like to know the cause so I can make a fix to the code. If these are public sumstats you're working on could you send me the link to where you downloaded them? If they aren't available online could you take a small subset of the sumstats that still gives the same error and send it here so I can debug further?

Thanks, Alan.

soren-rand commented 2 years ago

Ofcourse i'd be happy to help. the summary stats are already converted from GRCh38 -> 37, and I thought to keep the NA's for annotation in MungeSumstats. Can you send me a mail directly at sorenrand@sund.ku.dk? Thanks :-)

Al-Murphy commented 2 years ago

Closing because of inactivity, please feel free to reopen if you have a sample of the dataset for us to test