Open mmp3 opened 3 years ago
This is a nice idea and I will try to incorporate a solution in the release following the one I am about to drop. I need to think about how best to this.
Was a solution to this problem ever implemented? I'm currently working through this and am planning to pull sequence from the beginning of each genome to the end to bypass the issue but I agree that this solution is cumbersome.
For many circular genomes, gene annotations in NCBI RefSeq overlap the "end" of the linear representation, i.e. the arbitrary nucleotide chosen to be "position 0" for the linear representation of the circular genome happens to be in the middle of a gene. NCBI RefSeq has chosen to represent gene annotations in this case as extending beyond the "end" of the linear representation.
Such annotations cause
getfasta
to throw errors. For example, consider RefSeq genome NC_015457.1 (Shigella phage Shfl2). It has genegene-Shfl2p279
that spans the end/beginning of the linear representation of the genome.getfasta
throws the following error:Feature (NC_015457.1:165648-167826) beyond the length of NC_015457.1 size (165919 bp). Skipping.
Workarounds can be cumbersome.
One workaround is: scanning a BED file to identify annotations beyond the genome length, splitting them into two BED entries, and then stitching together the extracted FASTA sequences while respecting strand orientation. This workaround is trickier still when a genome consists of multiple sequences (chromosomes, contigs, etc.).
Another workaround is: for each sequence in the input fasta file, append a large segment (e.g. 20k nt) from the beginning of each input sequence to the end of itself, thus mimicking the circularity of the genome to an extent. Then, annotations that had extended "beyond the end" will not actually extend beyond the end in this augmented, pseudo-circular representation. This requires the generation of a new fasta file prior to
getfasta
.Suggestion: a new option called
--circular
ingetfasta
would be awfully convenient! :)