Open sjackman opened 9 years ago
Hmm, is this standard/expected? I have frankly never seen that before.
I believe that ##FASTA
is standard. That's the format output by Prokka. @tseemann
You learn something new every day. Is it one entire section of ##FASTA or are such sections interleaved throughout the file?
I think it's just one ##FASTA
section at the end of the file, but I don't know the for-certain answer to that question. Perhaps @tseemann knows.
Hi, It's still odd that it failed, as the ##FASTA should have been recognized as a header line, added to the header, and skipped, even if encountered after lines of actual data. I'll look into this. Thanks for reporting it.
Hi Neil, The issue is the actual FASTA sequence line thereafter.
Oooooh. Sorry, didn't read carefully enough. Well I guess we could rig it to skip that for GFF files, now that we know this is possible.
Here's the reference. Search for ##FASTA
.
http://www.sequenceontology.org/gff3.shtml
##FASTA
This notation indicates that the annotation portion of the file is at an end and that the remainder of the file contains one or more sequences (nucleotide or protein) in FASTA format. This allows features and sequences to be bundled together. All FASTA sequences included in the file must be included together at the end of the file and may not be interspersed with the features lines. Once a ##FASTA section is encountered no other content beyond valid FASTA sequence is allowed.
Thanks Shaun. We won't be able to get a fix for this into the forthcoming release, but we will try to tackle it for the subsequent release. Thanks for pointing this out!
No worries. It's easy enough to remove the ##FASTA
section with sed '/^##FASTA$/,$d'
It's not 100% clear from the GFF3 spec if you can not have the ##FASTA
section and still put >fasta
sequences at the end anyway. I've seen files like that. And BioPerl allows it when parsing in.
But it is true that have to be at the end only. I actually think it would be nice to allow interleaving. Sort of a "literate annotation" system with the isoforms and genes etc within the annotation. I once had a colleague who hand-annotated the whole malaria genome by inserting markup and spacing etc within the genome FASTA file. it was machine parseable - I converted it to Genbank at the time!
You've probably never seen ##FASTA
because it doesn't make sense for massive genomes with long term references. But for bacteria, the genomes are equally frequent as the annotations!
This problem seems to still exist and the suggested workaround doesn't actually work (you need to remove everything after and including the ##FASTA
line).
Here's my workaround
#!/bin/bash
#
# Script finds first line that begins with `##FASTA` and removes that and all following lines.
#
# Background: Prokka generates GFF files with FASTA sequences at the end.
# The program `bedtools` doesn't like this and throws an error.
#
INFILE=$1;
OUTFILE=${INFILE/.gff/_no-fasta.gff};
SEARCH_TERM="^##FASTA"
# Find line number using grep, remove additional info returned by grep and then reduce by 1
# Line number code taken from: https://stackoverflow.com/a/22020562/5322644
LINE_NUM=$(( $(grep -n -m 1 $SEARCH_TERM $INFILE |sed 's/\([0-9]*\).*/\1/') - 1))
if [ $INFILE != $OUTFILE ]; then
head -n $LINE_NUM "$INFILE" > "$OUTFILE";
fi
echo "Successfully removed fasta info from $INFILE and saved to $OUTFILE."
I'm amazed more folks don't encounter this problem.
a.gff