arq5x / bedtools2

bedtools - the swiss army knife for genome arithmetic
MIT License
934 stars 287 forks source link

bedtools intersect fails when GFF file contains a ##FASTA section #235

Open sjackman opened 9 years ago

sjackman commented 9 years ago
❯❯❯ bedtools intersect -v -a a.gff -b b.gff
Error: line number 11752 of file a.gff has 1 fields, but 9 were expected.

a.gff

…
38  prokka  gene    981 1121    .   -   1   ID=OU3MT_05875_gene;locus_tag=OU3MT_05875
##FASTA
>1
TTACAAGCTTGAAAGAATGAATGACATCCTTTTTCTTCCCTCAATGCTATAGAAATAAGG
…
arq5x commented 9 years ago

Hmm, is this standard/expected? I have frankly never seen that before.

sjackman commented 9 years ago

I believe that ##FASTA is standard. That's the format output by Prokka. @tseemann

arq5x commented 9 years ago

You learn something new every day. Is it one entire section of ##FASTA or are such sections interleaved throughout the file?

sjackman commented 9 years ago

I think it's just one ##FASTA section at the end of the file, but I don't know the for-certain answer to that question. Perhaps @tseemann knows.

nkindlon commented 9 years ago

Hi, It's still odd that it failed, as the ##FASTA should have been recognized as a header line, added to the header, and skipped, even if encountered after lines of actual data. I'll look into this. Thanks for reporting it.

arq5x commented 9 years ago

Hi Neil, The issue is the actual FASTA sequence line thereafter.

nkindlon commented 9 years ago

Oooooh. Sorry, didn't read carefully enough. Well I guess we could rig it to skip that for GFF files, now that we know this is possible.

sjackman commented 9 years ago

Here's the reference. Search for ##FASTA. http://www.sequenceontology.org/gff3.shtml

##FASTA

This notation indicates that the annotation portion of the file is at an end and that the remainder of the file contains one or more sequences (nucleotide or protein) in FASTA format. This allows features and sequences to be bundled together. All FASTA sequences included in the file must be included together at the end of the file and may not be interspersed with the features lines. Once a ##FASTA section is encountered no other content beyond valid FASTA sequence is allowed.

arq5x commented 9 years ago

Thanks Shaun. We won't be able to get a fix for this into the forthcoming release, but we will try to tackle it for the subsequent release. Thanks for pointing this out!

sjackman commented 9 years ago

No worries. It's easy enough to remove the ##FASTA section with sed '/^##FASTA$/,$d'

tseemann commented 9 years ago

It's not 100% clear from the GFF3 spec if you can not have the ##FASTA section and still put >fasta sequences at the end anyway. I've seen files like that. And BioPerl allows it when parsing in.

But it is true that have to be at the end only. I actually think it would be nice to allow interleaving. Sort of a "literate annotation" system with the isoforms and genes etc within the annotation. I once had a colleague who hand-annotated the whole malaria genome by inserting markup and spacing etc within the genome FASTA file. it was machine parseable - I converted it to Genbank at the time!

You've probably never seen ##FASTA because it doesn't make sense for massive genomes with long term references. But for bacteria, the genomes are equally frequent as the annotations!

mikegilchrist commented 1 year ago

This problem seems to still exist and the suggested workaround doesn't actually work (you need to remove everything after and including the ##FASTA line).

Here's my workaround

#!/bin/bash
#
# Script finds first line that begins with `##FASTA` and removes that and all following lines.
#
# Background: Prokka generates GFF files with FASTA sequences at the end.
#   The program `bedtools` doesn't like this and throws an error.
#

INFILE=$1;
OUTFILE=${INFILE/.gff/_no-fasta.gff};
SEARCH_TERM="^##FASTA"

# Find line number using grep, remove additional info returned by grep and then reduce by 1
# Line number code taken from: https://stackoverflow.com/a/22020562/5322644
LINE_NUM=$(( $(grep -n -m 1 $SEARCH_TERM $INFILE |sed  's/\([0-9]*\).*/\1/') - 1))
if [ $INFILE != $OUTFILE ]; then
   head -n $LINE_NUM "$INFILE" > "$OUTFILE";
fi

echo "Successfully removed fasta info from $INFILE and saved to $OUTFILE."

I'm amazed more folks don't encounter this problem.