Closed SebastianHollizeck closed 5 years ago
Please ask questions about package use on the support site https://support.bioconductor.org .
Probably what you want to do is to create a VcfFile()
representing the vcf file, and specify a 'yield size' (typically, 10,000 - 100,000 records at a time) representing the number of records to read
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- VcfFile(fl, yieldSize = 5000)
Then, open the file and iterate through it
open(vcf)
repeat {
result <- readVcf(vcf)
if (length(result) == 0)
break
## work on this chunk of the file
message(nrow(result))
}
close(vcf)
There is an example on the help page ?readVcf
## ---------------------------------------------------------------------
## Iterate through VCF with 'yieldSize'
## ---------------------------------------------------------------------
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
param <- ScanVcfParam(fixed="ALT", geno=c("GT", "GL"), info=c("LDAF"))
tab <- TabixFile(fl, yieldSize=4000)
open(tab)
while (nrow(vcf_yield <- readVcf(tab, "hg19", param=param)))
cat("vcf dim:", dim(vcf_yield), "\n")
close(tab)
The GenomicFiles package reduceByYield()
function might also be relevant.
Sorry if my message didnt come across, I consider this a bug or "not working as documented" I am quite capable of using other methods to read that vcf, but the documentation of the scanVcf methods says "scanVcf with param="missing" and file="character" or file="connection" scan the entire file. With file="connection", an argument n indicates the number of lines of the VCF file to input; a connection open at the beginning of the call is open and incremented by n lines at the end of the call, providing a convenient way to stream through large VCF files."
If that is not the case, you might want to either remove that line from the documentation of scanVcf or fix the bug.
Cheers, Sebastian
Thanks I did not understand from your original comment that the documentation was not accurate. I'll update the documentation.
Hmm, looking closer I see that it is documented and implemented, but with several bugs... I'll provide a patch
This should be fixed when RELEASE_3_9 builds tonight at version 1.30.1, or the devel version builds tonight at 1.31.3.
Hey, sorry but this did not fix the issue.
I have VariantAnnotation 1.30.1 attached but I still get the same error. And from what I can see, the error does not come from the .vcf_scan_connection function that you patched, because it does not have the "scanVcf: " prefix But it comes from somewhere else.
> line <- scanVcf(file=vcfFile,n=1)
Error: $ operator is invalid for atomic vectors
The only thing I can think of is the result <- .Call(.scan_vcf_connection, txt, maps$samples, maps$fmap, maps$imap, maps$gmap, row.names) Also you are passing the already read in lines to the function, which by its name is made for a connection and not for plain text. But I dont know enough about the structure of VariantAnnotation to be able to help.
However the error still persists in the last version
I have
> packageVersion("VariantAnnotation")
[1] '1.30.1'
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> vcf_file <- file(fl)
> vcf_line <- scanVcf(file=vcf_file,n=1)
> names(vcf_line[[1]])
[1] "rowRanges" "REF" "ALT" "QUAL" "FILTER" "INFO"
[7] "GENO"
> vcf_line[[1]]$REF
A DNAStringSet instance of length 1
width seq
[1] 1 G> packageVersion("VariantAnnotation")
[1] '1.30.1'
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> vcf_file <- file(fl)
> vcf_line <- scanVcf(file=vcf_file,n=1)
> names(vcf_line[[1]])
[1] "rowRanges" "REF" "ALT" "QUAL" "FILTER" "INFO"
[7] "GENO"
> vcf_line[[1]]$REF
A DNAStringSet instance of length 1
width seq
[1] 1 G
Do you have a complete example that fails?
.scan_vcf_connection
is called internally; probably it's visible in traceback()
after the error occurs, or going forward
> selectMethod("scanVcf", c("connection", "missing"))
Method Definition:
function (file, ..., param)
{
.vcf_scan_connection(file, ...)
}
<bytecode: 0x7fd4c876b600>
<environment: namespace:VariantAnnotation>
Signatures:
file param
target "connection" "missing"
defined "connection" "missing"
Oh I am very sorry, my R must have had a hiccup.
It is working fine now!
Thank you so much for the fix and sorry for the false alarm
No problem, thanks for the bug report.
Hi,
I was just trying to stream through a very big VCF because it is too big to store in memory and saw that scanVcf has that capability theoretically. However I get an arrow instead.
Here the reproducible example
which just says "Error: $ operator is invalid for atomic vectors"
but I CAN use readLine(vcf_file,n=1)
also the session info