Closed bschilder closed 3 years ago
Strange! Maybe try just running a single id (ebi-a-GCST005647 since it's where the error was) through import_sumstats and see if the error persists?
Any update on whether this is a HPC or MungeSumstats issue? Locally I can run:
gwas <- MungeSumstats::import_sumstats(ids = "ebi-a-GCST005647",
save_dir = "~/Downloads/",
nThread = 10,
parallel_across_ids = FALSE,
force_new = FALSE)
without an issue so I guess the issue comes from the save directory on the HPC. Maybe try and run the above just on the HPC and let me know if the issue persists.
Any update on this @bschilder?
Just tried again on HPC, still seems to be happening. Tried logging the errors but doesnt seem like anything was written before it ended:
gwas <- MungeSumstats::import_sumstats(ids = metagwas$id[1],
save_dir = "/rds/general/project/neurogenomics-lab/li
ve/GWAS_sumstats/OpenGWAS",
nThread = 8, force_new=T,
parallel_across_ids = FALSE, log_folder_ind=T, log_mungesumstats_msgs=T, log_folder="logs/")
Based on the messages, seems to be happening right after check_save_path
:
Processing 1 datasets from Open GWAS.
========== Processing dataset : finn-a-G6_ALZHEIMER ==========
Downloading VCF ==> /rds/general/ephemeral/user/bms20/ephemeral//Rtmpdjpunb/finn-a-G6_ALZHEIMER.vcf.gz
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/finn-a-G6_ALZHEIMER/finn-a-G6_ALZHEIMER.vcf.gz'
Content type 'application/gzip' length 222988378 bytes (212.7 MB)
==================================================
downloaded 212.7 MB
Downloading VCF index ==> https://gwas.mrcieu.ac.uk/files/finn-a-G6_ALZHEIMER/finn-a-G6_ALZHEIMER.vcf.gz.tbi
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/finn-a-G6_ALZHEIMER/finn-a-G6_ALZHEIMER.vcf.gz.tbi'
Content type 'application/gzip' length 1600703 bytes (1.5 MB)
==================================================
downloaded 1.5 MB
Formatted summary statistics will be saved to ==> /rds/general/project/neurogenomics-lab/li
ve/GWAS_sumstats/OpenGWAS/finn-a-G6_ALZHEIMER.tsv.gz
Log data to be saved to ==> logs/
missing value where TRUE/FALSE needed
Done with all processing in 0.4 minutes.
Okay so this runs for me locally (see below), what version of MungeSumstats are you using? Hoping that's it and it isn't a HPC issue. Maybe trying running again with the latest version of MungeSumstats (1.1.26) and see if that solves it?:
> gwas <- MungeSumstats::import_sumstats(ids = "finn-a-G6_ALZHEIMER",
+ save_dir = "~/Downloads/",
+ nThread = 8, force_new=T,
+ parallel_across_ids = FALSE,
+ log_folder_ind=T,
+ log_mungesumstats_msgs=T,
+ log_folder="~/Downloads/")
Processing 1 datasets from Open GWAS.
========== Processing dataset : finn-a-G6_ALZHEIMER ==========
Downloading VCF ==> /var/folders/hd/jm8lzp7s4dl_wlkykzhz66x80000gn/T//Rtmp6hFEUF/finn-a-G6_ALZHEIMER.vcf.gz
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/finn-a-G6_ALZHEIMER/finn-a-G6_ALZHEIMER.vcf.gz'
Content type 'application/gzip' length 222988378 bytes (212.7 MB)
==================================================
downloaded 212.7 MB
Downloading VCF index ==> https://gwas.mrcieu.ac.uk/files/finn-a-G6_ALZHEIMER/finn-a-G6_ALZHEIMER.vcf.gz.tbi
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/finn-a-G6_ALZHEIMER/finn-a-G6_ALZHEIMER.vcf.gz.tbi'
Content type 'application/gzip' length 1600703 bytes (1.5 MB)
==================================================
downloaded 1.5 MB
Formatted summary statistics will be saved to ==> /Users/alanmurphy/Downloads//finn-a-G6_ALZHEIMER.tsv.gz
Log data to be saved to ==> /Users/alanmurphy/Downloads/
Returning path to saved data.
finn-a-G6_ALZHEIMER : Done in 7.53 minutes.
Done with all processing in 7.53 minutes.
And the file:
> data.table::fread('/Users/alanmurphy/Downloads//finn-a-G6_ALZHEIMER.tsv.gz')
|--------------------------------------------------|
|==================================================|
SNP CHR BP A1 A2 INFO BETA SE LP FRQ P
1: rs556300838 1 589557 C T 0.999627 2.0439 1.0469 1.2932800 0.000373 0.05090026
2: rs28760963 1 591353 C T 0.995890 0.1759 0.4113 0.1746390 0.004110 0.66889970
3: rs368254419 1 610932 C T 0.966570 0.0036 0.1818 0.0070049 0.033430 0.98400000
4: rs778009914 1 630224 G A 0.999746 0.2949 2.7867 0.0382468 0.000254 0.91569997
5: rs1972379 1 632317 G A 0.999271 0.3242 1.5002 0.0814979 0.000729 0.82889992
---
2005582: rs9616974 22 50779526 G A 0.927690 -0.0207 0.0766 0.1038050 0.072310 0.78739926
2005583: rs9616832 22 50780959 T C 0.927610 -0.0103 0.0767 0.0488568 0.072390 0.89360008
2005584: rs9616978 22 50781891 C G 0.927700 -0.0205 0.0766 0.1030330 0.072300 0.78880018
2005585: rs115055839 22 50783303 T C 0.927690 -0.0207 0.0766 0.1038050 0.072310 0.78739926
2005586: rs9616985 22 50791377 T C 0.927780 -0.0213 0.0766 0.1072380 0.072220 0.78119958
Yes, also runs fine for me locally. The issue is only on HPC. Just trying to sort out why that is
I installed it on HPC using conda: https://anaconda.org/bioconda/bioconductor-mungesumstats
But it seems this may be out of date. I'll try updating from within R
Yeah that would be old now I guess! It will be up-to-date come the middle of the month when bioc 3.14 goes live but yeah I think just install it from GitHub in R for now!
Let me know if that works!
Based on the SessionInfo in my first post, looks like I was using MungeSumstats
v1.1.14 .
I just upgraded to the latest using devtools::install_github("neurogenomics/MungeSumstats")
and tried running with "finn-a-G6_ALZHEIMER". I also updated all its dependencies. However the issue persists.
Can you try replicating on HPC?
Current session info:
I've identified at least one issue (unsure yet if it's related). Looks like some things were changed in check_save_path
. Currently, this part will give the incorrect file suffix (.tsv) when the selected format is supposed to be .tsv.gz, because the grep function doesn't include any modifiers indicating that the suffix is at the end of the string.
I'm correcting this with the following:
suffix_match <-
vapply(suffixes, function(x) {
grepl(paste0("*",x,"$"), tolower(save_path),
ignore.case = TRUE
)
},
FUN.VALUE = logical(1) )
I think that was unrelated, but I think I've also found the relevant bug.
Within validate_parameters
:
Looks like you have some code to handle the possibility of ref_genome
being NULL in earlier lines (!is.null(convert_ref_genome) && ...
), but that seems to be missing that here. So logical(0)
is returned which throws the missing value where TRUE/FALSE needed
error.
I would use the following solution with any()
:
if (any(toupper(ref_genome) == "GRCH37") &&
!requireNamespace("SNPlocs.Hsapiens.dbSNP144.GRCh37", quietly = TRUE)) {
stop(GRCH37_msg1)
}
Might be worth combing through your code and making sure there's no other instances where this can happen.
All fixes now implemented in: https://github.com/neurogenomics/MungeSumstats/tree/bschilder_dev
I've confirmed that the bschilder_dev branch does indeed solve the problem on HPC. Also confirmed that this branch passes all CRAN and Bioc checks, with the exception of the weird Warning about version release numbers:
r$> BiocCheck::BiocCheck()
This is BiocCheck version 1.28.0. BiocCheck is a work in progress. Output and severity of issues may
change. Installing package...
API: public: http://gwas-api.mrcieu.ac.uk/
* Checking Package Dependencies...
* Checking if other packages can import this one...
* Checking to see if we understand object initialization...
* Checking for deprecated package usage...
* Checking for remote package usage...
* Checking for 'LazyData: true' usage...
* Checking version number...
* Checking version number validity...
* WARNING: y of x.y.z version should be even in release
* Checking R Version dependency...
* Checking package size...
Skipped... only checked on source tarball
* Checking individual file sizes...
* Checking biocViews...
* Checking that biocViews are present...
* Checking package type based on biocViews...
Software
* Checking for non-trivial biocViews...
* Checking that biocViews come from the same category...
* Checking biocViews validity...
* Checking for recommended biocViews...
* Checking build system compatibility...
* Checking for blank lines in DESCRIPTION...
* Checking if DESCRIPTION is well formatted...
* Checking for proper Description: field...
* Checking for whitespace in DESCRIPTION field names...
* Checking that Package field matches directory/tarball name...
* Checking for Version field...
* Checking for valid maintainer...
* Checking License: for restrictive use...
* Checking DESCRIPTION/NAMESPACE consistency...
* Checking .Rbuildignore...
* Checking vignette directory...
* Checking library calls...
* Checking for library/require of MungeSumstats...
* Checking coding practice...
* Checking parsed R code in R directory, examples, vignettes...
* Checking function lengths......................................................................................
* NOTE: Recommended function length <= 50 lines.
There are 29 functions > 50 lines.
The longest 5 functions are:
format_sumstats() (R/format_sumstats.R, line 164): 713 lines
check_no_rs_snp() (R/check_no_rs_snp.R, line 17): 312 lines
validate_parameters() (R/validate_parameters.R, line 6): 268 lines
read_vcf() (R/read_vcf.R, line 18): 219 lines
find_sumstats() (R/find_sumstats.R, line 64): 187 lines
* Checking man page documentation...
* Checking package NEWS...
* Checking unit tests...
* Checking skip_on_bioc() in tests...
* Checking formatting of DESCRIPTION, NAMESPACE, man pages, R source, and vignette source...
* NOTE: Consider shorter lines; 218 lines (2%) are > 80 characters long.
First 6 lines:
R/check_allele_flip.R:28 allele_flip_check, allele_flip_drop, allele_flip_z,
R/check_allele_flip.R:34 # https://github.com/GenomicSEM/GenomicSEM/blob/fc8f17a817a8022d6900acf41824d27b367...
R/check_chr.R:47 name <- get_unique_name_log_file(name = name, log_files = log_files)
R/check_col_order.R:39 setdiff(seq_len(length(col_headers)), c(whichSNP, whichCHR, whichBP))
R/check_dup_bp.R:30 name <- get_unique_name_log_file(name = name, log_files = log_files)
R/check_dup_bp.R:44 paste0(check_save_out$log_folder, "/", name, check_save_out$extension)
* NOTE: Consider multiples of 4 spaces for line indents, 709 lines(7%) are not.
First 6 lines:
R/axel.R:14 output_path,
R/axel.R:15 background = FALSE,
R/axel.R:16 nThread = 1,
R/axel.R:17 force_overwrite = FALSE,
R/axel.R:18 quiet = TRUE,
R/axel.R:19 alternate = TRUE,
See http://bioconductor.org/developers/how-to/coding-style/
See styler package: https://cran.r-project.org/package=styler as described in the BiocCheck vignette.
* Checking if package already exists in CRAN...
* Checking for bioc-devel mailing list subscription...
* NOTE: Cannot determine whether maintainer is subscribed to the bioc-devel mailing list (requires
admin credentials). Subscribe here: https://stat.ethz.ch/mailman/listinfo/bioc-devel
* Checking for support site registration...
Maintainer is registered at support site.
Package name is in support site watched tags.
Summary:
ERROR count: 0
WARNING count: 1
NOTE count: 4
For detailed information about these checks, see the BiocCheck vignette, available at
https://bioconductor.org/packages/3.13/bioc/vignettes/BiocCheck/inst/doc/BiocCheck.html#interpreting-bioccheck-output
$error
character(0)
$warning
[1] "y of x.y.z version should be even in release"
$note
[1] "Recommended function length <= 50 lines."
[2] "Consider shorter lines; 218 lines (2%) are > 80 characters long."
[3] "Consider multiples of 4 spaces for line indents, 709 lines(7%) are not."
[4] "Cannot determine whether maintainer is subscribed to the bioc-devel mailing list (requires admin\ncredentials). Subscribe here: https://stat.ethz.ch/mailman/listinfo/bioc-devel"
As always, HPC is a joy to run otherwise working scripts on.
Requested the following interactive resources on HPC:
Then tried the following in
R
:The VCFs seem to download ok, but then I get the
missing value where TRUE/FALSE needed
error right after.I'm showing the output from when I had already downloaded the VCF files first, and then imported them on a 2nd run. I checked and it doesn't seem like the VCFs got screwed up or anything during download (e.g. only partially downloaded).
I ran again with set to
effect_columns_nonzero=TRUE
andeffect_columns_nonzero=FALSE
(just in case), but didn't seem to make any difference.Session info