arq5x / bedtools2

bedtools - the swiss army knife for genome arithmetic
MIT License
927 stars 287 forks source link

could bedtools getfasta work with gzipped fasta? #56

Open pamag opened 10 years ago

pamag commented 10 years ago

Hi Aron, As we have all our genomic fasta gziped for using with bwa, I was wondering if bedtools getfasta can work with gzipped data.

Pablo.

mdshw5 commented 9 years ago

As far as I know, BWA only accepts gzipped FASTA files during indexing, or as sequences to align. In either of these cases, BWA only needs to read the files sequentially and does not do any random access via byte offsets.

bedtools getfasta needs to seek to specific offsets within the FASTA file, and these offsets cannot be calculated from a gzipped file. This probably won't happen.

hzpc-joostk commented 7 years ago

I would like to raise interest for this issue again. Our FASTA files have been compressed in BGZF (bgzip) to reduce space and are indexed using samtools faidx. Quite a lot of tools are able to use these compressed+faidx fasta files. Why not bedtools getfasta?

Is there any how I can help?

I often use GNU Parallel

< ranges.bed parallel --colsep '\t' -k samtools faidx /path/to/ref.fasta.gz '{1}:{2}-{3}' > ranges.fasta

But, since BED chromStart is 0-based, I use {=2 $_+=1 =} to work around this:

< ranges.bed parallel --colsep '\t' -k samtools faidx /path/to/ref.fasta.gz '{1}:{=2 $_+=1 =}-{3}' > ranges.fasta

This is quite ugly and does not include nice options that bedtools getfasta offers:

    -name   Use the name field for the FASTA header
    -split  given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
    -tab    Write output in TAB delimited format.
            - Default is FASTA format.

    -s      Force strandedness. If the feature occupies the antisense,
            strand, the sequence will be reverse complemented.
            - By default, strand information is ignored.

    -fullHeader     Use full fasta header.
            - By default, only the word before the first space or tab is used.
mdshw5 commented 7 years ago

I’ll point out that pyfaidx supports BGZF compresses FASTA and the faidx script supports bed files as input.

faidx -b ranges.bed /path/to/ref.fasta.gz 
ghuls commented 7 years ago

If BEDtools would use HTSlib, it would get this support for free: http://www.htslib.org/

header file: htslib/faidx.h

HTSlib could be useful for general VCF/BCF/SAM/BAM/CRAM, gzipped input files indexed by bgzip, reading files over HTTP support in BEDtools.

arq5x commented 7 years ago

This is on the TODO list for this year - hoping to pull it off soon.

Xethic commented 5 years ago

Bump Is there any update planned?

arq5x commented 5 years ago

we are in the process of moving to htslib support and this functionality will be part of the change.

Xethic commented 5 years ago

we are in the process of moving to htslib support and this functionality will be part of the change.

Wonderful. Thanks for the prompt answer.

arq5x commented 5 years ago

@jmarshall @brentp @38 is there an API in htslib for creating indexes of FASTA files (uncompressed or otherwise) for reading querying indexed (and possibly BGZF'ed) files in htslib? If so, it could be leveraged to address this issue.

brentp commented 5 years ago

yes, see: https://github.com/samtools/htslib/blob/develop/htslib/faidx.h I can have a look at this in a couple weeks (unless @jmarshall beats me too it).

sbooeshaghi commented 1 year ago

Thank you all for your work in developing and supporting bedtools. I have a project that would benefit from bedtools getfasta taking gzipped FASTA as input.

arq5x commented 1 year ago

@brentp would you mind tackling this rather useful enhancement?

brentp commented 1 year ago

yes.

bredeson commented 1 year ago

Hi bedtools Devs! Following up on this thread, is bedtools supposed to be able to read bgzip'd files at this time (bedtools v2.30.0-38-g38975ff2)?

I have a bgzip'd and indexed reference:

18:16:15|login28|path2.1>>> ls -1 path2.fasta.gz*
-rw-r----- 1 bredeson 137K Apr  7 18:09 path2.fasta.gz
-rw-r----- 1 bredeson  104 Apr  7 18:10 path2.fasta.gz.gzi
-rw-r----- 1 bredeson   23 Apr  7 18:10 path2.fasta.gz.fai
18:16:43|login28|path2.1>>> zcat path2.fasta.gz | head -1
>path2

but when I attempt to use getfasta from it with a simple test BED file:

18:16:47|login28|path2.1>>> cat test.bed 
path2   0   100

I get corrupted encoded output:

18:18:24|login28|path2.1>>> bedtools getfasta -fi path2.fasta.gz -bed test.bed >test.fa
18:19:08|login28|path2.1>>> cat test.fa
>path2:0-100
ÿBCNm½Ir£KÌe9çrjiÃÀÃÔþ-8ç:õ~«×(Bù5îhnRÿëÿýßÿßÿù>UÛݵÿþéÝù÷ÝÌToÍ÷áíï#µ³ÿ¾|¿Î¿ÿý
18:19:18|login28|path2.1>>> mv test.fa test.fa.gz && gunzip test.fa.gz
gzip: test.fa.gz: not in gzip format

-Jessen

brentp commented 1 year ago

Hi @bredeson , can you pull the latest from github, re-build, and try again? You will need this version:

$ ./bin/bedtools --version
bedtools v2.30.0-66-gb891a0b6
bredeson commented 1 year ago

Hi @brentp, This newer version works! Thanks!

nathanweeks commented 3 months ago

I think this issue can be closed?