Closed bschilder closed 2 years ago
One place to start might be adding ".vcf.tsv", ".vcf.tsv.gz", and ".vcf.tsv.bgz" as an additional possibilities in MungeSumstats:::supported_suffixes()
:
Seems to be an issue reading the header of this VCF (probably to do with the format?) with variant annotation:
VariantAnnotation::scanVcfHeader(path)
Error in scanBcfHeader(bf) : [internal] _hts_rewind() failed
Other than that it will run once I add #CHROM
to the mapping file. So my solution is to add an error catch to VariantAnnotation::scanVcfHeader(path)
in read_header() and use your other approach:
header <- readLines(path, n = 100)
i <- which(startsWith(header, "#CHR"))
header <- data.table::fread(text = header[seq(i, i + n)],
nThread = 1)
For vcfs if this fails. Let me know if there is any downstream issue of not using variantAnnotation approach? I'll add these fixes into the solution I have for Indels to be pushed to the master branch (not current) so use the github version for this fix (until late April when it's released)
Excellent, I think this sounds like a solid solution to me. Thanks, Alan!
I'll keep you posted about any downstream issues. Currently dealing with one from my more manual solution above, with some weird errors about the rows being out of order (when they don't seem to be) during tabix indexing:
Added to master branch
1. Bug description
There seems to be some issues when trying to munge the Psychiatric Genomics Consortium (PGC) sumstats format, which is a bit different from the OpenGWAS format.
FIrst of all, the file names end in ".vcf.tsv.gz" which is confusing and might be tripping up our code that infers file type by extension names. Wondering if this is happening due to a slight discrepancy between how
read_sumstats
andread_header
are inferring file type, because format_sumstats does manage to get partway through before hitting an error (it even correctly counts the number of rows!).Here's part of the header from one of these files:
Console output
Error from the first reprex below.
Also including the full message output. MungeSumstats_log_msg.txt
Expected behaviour
format_sumstats
is able to run the full pipeline and produce a munged tsv.2. Reproducible example
Code
This produces an error
But this works ok
I tried reformatting the file manually so it was undoubtedly a regular tsv file.
Data
The data can be downloaded here.
Session info