arq5x / bedtools2

bedtools - the swiss army knife for genome arithmetic
MIT License
942 stars 287 forks source link

circular genomes and getfasta #882

Open mmp3 opened 3 years ago

mmp3 commented 3 years ago

For many circular genomes, gene annotations in NCBI RefSeq overlap the "end" of the linear representation, i.e. the arbitrary nucleotide chosen to be "position 0" for the linear representation of the circular genome happens to be in the middle of a gene. NCBI RefSeq has chosen to represent gene annotations in this case as extending beyond the "end" of the linear representation.

Such annotations cause getfasta to throw errors. For example, consider RefSeq genome NC_015457.1 (Shigella phage Shfl2). It has gene gene-Shfl2p279 that spans the end/beginning of the linear representation of the genome. getfasta throws the following error:

Feature (NC_015457.1:165648-167826) beyond the length of NC_015457.1 size (165919 bp). Skipping.

Workarounds can be cumbersome.

One workaround is: scanning a BED file to identify annotations beyond the genome length, splitting them into two BED entries, and then stitching together the extracted FASTA sequences while respecting strand orientation. This workaround is trickier still when a genome consists of multiple sequences (chromosomes, contigs, etc.).

Another workaround is: for each sequence in the input fasta file, append a large segment (e.g. 20k nt) from the beginning of each input sequence to the end of itself, thus mimicking the circularity of the genome to an extent. Then, annotations that had extended "beyond the end" will not actually extend beyond the end in this augmented, pseudo-circular representation. This requires the generation of a new fasta file prior to getfasta.

Suggestion: a new option called --circular in getfasta would be awfully convenient! :)

arq5x commented 3 years ago

This is a nice idea and I will try to incorporate a solution in the release following the one I am about to drop. I need to think about how best to this.

cjfiscus commented 1 year ago

Was a solution to this problem ever implemented? I'm currently working through this and am planning to pull sequence from the beginning of each genome to the end to bypass the issue but I agree that this solution is cumbersome.