Open pamag opened 10 years ago
As far as I know, BWA only accepts gzipped FASTA files during indexing, or as sequences to align. In either of these cases, BWA only needs to read the files sequentially and does not do any random access via byte offsets.
bedtools getfasta
needs to seek to specific offsets within the FASTA file, and these offsets cannot be calculated from a gzipped file. This probably won't happen.
I would like to raise interest for this issue again. Our FASTA files have been compressed in BGZF (bgzip) to reduce space and are indexed using samtools faidx
. Quite a lot of tools are able to use these compressed+faidx fasta files. Why not bedtools getfasta
?
Is there any how I can help?
I often use GNU Parallel
< ranges.bed parallel --colsep '\t' -k samtools faidx /path/to/ref.fasta.gz '{1}:{2}-{3}' > ranges.fasta
But, since BED chromStart is 0-based, I use {=2 $_+=1 =}
to work around this:
< ranges.bed parallel --colsep '\t' -k samtools faidx /path/to/ref.fasta.gz '{1}:{=2 $_+=1 =}-{3}' > ranges.fasta
This is quite ugly and does not include nice options that bedtools getfasta
offers:
-name Use the name field for the FASTA header
-split given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
-tab Write output in TAB delimited format.
- Default is FASTA format.
-s Force strandedness. If the feature occupies the antisense,
strand, the sequence will be reverse complemented.
- By default, strand information is ignored.
-fullHeader Use full fasta header.
- By default, only the word before the first space or tab is used.
I’ll point out that pyfaidx supports BGZF compresses FASTA and the faidx
script supports bed files as input.
faidx -b ranges.bed /path/to/ref.fasta.gz
If BEDtools would use HTSlib, it would get this support for free: http://www.htslib.org/
header file: htslib/faidx.h
HTSlib could be useful for general VCF/BCF/SAM/BAM/CRAM, gzipped input files indexed by bgzip, reading files over HTTP support in BEDtools.
This is on the TODO list for this year - hoping to pull it off soon.
Bump Is there any update planned?
we are in the process of moving to htslib support and this functionality will be part of the change.
we are in the process of moving to htslib support and this functionality will be part of the change.
Wonderful. Thanks for the prompt answer.
@jmarshall @brentp @38 is there an API in htslib for creating indexes of FASTA files (uncompressed or otherwise) for reading querying indexed (and possibly BGZF'ed) files in htslib? If so, it could be leveraged to address this issue.
yes, see: https://github.com/samtools/htslib/blob/develop/htslib/faidx.h I can have a look at this in a couple weeks (unless @jmarshall beats me too it).
Thank you all for your work in developing and supporting bedtools
. I have a project that would benefit from bedtools getfasta
taking gzipped FASTA as input.
@brentp would you mind tackling this rather useful enhancement?
yes.
Hi bedtools Devs! Following up on this thread, is bedtools supposed to be able to read bgzip'd files at this time (bedtools v2.30.0-38-g38975ff2)?
I have a bgzip'd and indexed reference:
18:16:15|login28|path2.1>>> ls -1 path2.fasta.gz*
-rw-r----- 1 bredeson 137K Apr 7 18:09 path2.fasta.gz
-rw-r----- 1 bredeson 104 Apr 7 18:10 path2.fasta.gz.gzi
-rw-r----- 1 bredeson 23 Apr 7 18:10 path2.fasta.gz.fai
18:16:43|login28|path2.1>>> zcat path2.fasta.gz | head -1
>path2
but when I attempt to use getfasta
from it with a simple test BED file:
18:16:47|login28|path2.1>>> cat test.bed
path2 0 100
I get corrupted encoded output:
18:18:24|login28|path2.1>>> bedtools getfasta -fi path2.fasta.gz -bed test.bed >test.fa
18:19:08|login28|path2.1>>> cat test.fa
>path2:0-100
ÿBCNm½Ir£KÌe9çrjiÃÀÃÔþ-8ç:õ~«×(Bù5îhnRÿëÿýßÿßÿù>UÛݵÿþéÝù÷ÝÌToÍ÷áíï#µ³ÿ¾|¿Î¿ÿý
18:19:18|login28|path2.1>>> mv test.fa test.fa.gz && gunzip test.fa.gz
gzip: test.fa.gz: not in gzip format
-Jessen
Hi @bredeson , can you pull the latest from github, re-build, and try again? You will need this version:
$ ./bin/bedtools --version
bedtools v2.30.0-66-gb891a0b6
Hi @brentp, This newer version works! Thanks!
I think this issue can be closed?
Hi Aron, As we have all our genomic fasta gziped for using with bwa, I was wondering if bedtools getfasta can work with gzipped data.
Pablo.