API to change output suffix to .gz

d-cameron commented 4 years ago

The majority of VCF handling bioinformatics libraries use a .vcf.gz suffix, even for block gziped output. writeVcf() with index=TRUE does not support this and forceably sets the suffix to .bgz.

The following commands do exactly the same thing:

writeVcf(vcf, "example.vcf", index=TRUE)
writeVcf(vcf, "example.vcf.bgz", index=TRUE)
writeVcf(vcf, "example.vcf.gz", index=TRUE)

Desired behaviour: specifying a .vcf.gz as the output file, actually writes to the output file instead of silently changing the suffix of the output file to .vcf.bgz.

d-cameron commented 4 years ago

https://github.com/PapenfussLab/gridss/issues/269

bschilder commented 2 years ago

I agree, this is very unexpected. Could a warning at least be generated when writeVcf changes the path name?

vjcitn commented 2 years ago

I suppose a message could be produced. The man page for bgzip in Rsamtools shows why this is happening. If you have time to make a PR the code of interest is at https://github.com/Bioconductor/VariantAnnotation/blob/e966a1b3cc3cede22c4d65fcada24be05206a65a/R/methods-writeVcf.R#L257

bschilder commented 2 years ago

In the interest of bettering VariantAnnotation, while also being mindful I'm not able to devote too much time to projects I'm not a maintainer or author of, I propose the following divvying of work:

@vjcitn

[ ] readVcf: implement the faster chunking method in readVcf discussed here (or other ways of speeding up this functionality).
[ ] vcf2df: I can include my initial version in my PR (ive made some further improvements since the original post), but after I've submitted the PR I'd ask that you optimise this further to reduce compute time as much as possible.

@bschilder

[ ] select_vcf_fields: Function that speeds up readVcf by only importing non-empty columns.
[ ] writeVcf: Add the message discussed in this present Issue.

Does this sound fair to you? I can get started once item # 1 on your list is completed (readVcf). That way the rest of the changes I plan to make will be optimised for the updated version of VariantAnnotation.

Best, Brian

hpages commented 1 year ago

@vjcitn @bschilder Are guys planning to follow up on this?

vjcitn commented 1 year ago

@bschilder did you make the PR that you mentioned? i cannot work on the other piece for some time.

bschilder commented 1 year ago

@hpages I didn't make this PR because @vjcitn determined it was beyond the scope of VariantAnnotation to include these functionalities (or at least some of them). So instead, I added them to our lab's package MungeSumstats. If you've changed your mind about this @vjcitn I'd be happy to share my existing code.

regarding readVcf: https://github.com/Bioconductor/VariantAnnotation/issues/59#issuecomment-1119026274

In summary, I personally have no basis for going into readVcf and trying to speed it up. It may be possible, but given that the original authors have left the project and there is a reasonable path to divide and conquer with the ingestion process, I am not inclined to do much more.
regarding vcf2df: https://github.com/Bioconductor/VariantAnnotation/issues/57#issuecomment-1125898909
regarding select_vcf_fields: Not sure a decision was reached on this. @vjcitn could you confirm whether you think this would be useful to integrate into VariantAnnotation? I seem to recall some concerns about omitting certain columns based on the sampling of only the first N rows.
regarding writeVcf: @vjcitn not sure I heard back about this, is this something you'd find helpful for me to implement?

Bioconductor / VariantAnnotation

API to change output suffix to .gz #35

@vjcitn

@bschilder