GMOD / GBrowse

the Generic Genome Browser
http://gmod.org/wiki/GBrowse
Other
49 stars 37 forks source link

gtf2gff3.pl doesn't work with recent versions of Ensembl #35

Open keiranmraine opened 10 years ago

keiranmraine commented 10 years ago

Hi,

I've found that some transcripts don't get converted correctly when using GTF files downloaded from Ensembl since they started attaching the correct bio-type to the genes/transcripts.

I have a set of files I can email to an appropriate account, details of them are below.

In the examples e58 versions result in correct protein coding records that display in GBrowse fine. Under e71+ BRAF (plus many others) doesn't display under GBrowse if you choose a source of protein_coding . I think this is because the same ENSG can link to protein_coding and, in this example, nonsense_mediated_decay and retained_intron. In this case the gene record is tagged as retained_intron.

I think the solution is to have a gene record for each but I'm not too sure if this is possible as this will result in 3 gene entries for the same ENSG (but each would be for a different source and potentially a different length).

Hope I've provided sufficient info to get this looked at quickly as it's preventing us from upgrading our Ensembl version in the browser.

Kind regards,

Keiran Raine Principal Bioinformatician Cancer Genome Project Wellcome Trust Sanger Institute

BRAF_e58.gtf/gff3 are the GTF records and corresponding gff3 generated using the default cfg found on sequenceontology.org.

BRAF_e71.gtf/gff3 as above.

grep_for_ENST00000496384.txt is the result of grepping the two gff3 files for what the filename describes... this suggest this bit is fine

grep_for_ID=ENSG00000157764.txt is the result of grepping the two gff3 files for what the filename describes...in e71 the gene has been marked as a retained intron.

scottcain commented 10 years ago

Hi Keiran,

There are actually two different versions of this script--one that is part of the GBrowse distribution and one that is hosted at the Sequence Ontology. I'm working on figuring out which one is "best" and where it will be housed going forward; in that process, I'll address this bug.

Scott

keiranmraine commented 10 years ago

Hi Scott,

I've had a little time to have a look and I think the following will make it work for ensembl data where 'gene_biotype' has been defined (and so work with other incarnations of gtf).

diff --git a/bin/gtf2gff3.pl b/bin/gtf2gff3.pl
index 1058960..663afcf 100755
--- a/bin/gtf2gff3.pl
+++ b/bin/gtf2gff3.pl
@@ -154,6 +154,11 @@ sub parse_gtf {

                $feature_type = $INPUT_FEATURE_MAP->{$feature_type};

+    if(exists $attributes->{'gene_biotype'}) {
+      $source = $attributes->{'gene_biotype'}->[0];
+      delete $attributes->{'gene_biotype'};
+    }
+
                #Note here that we're mapping between GTF/GFF and GFF3
                #nomeclature for the hash keys
                my $feature = {seq_id     => $seqname,

It would be ideal if the 'transcript biotype' could be appended to the attributes output in the gff3 but that's a nicety.

Regards, Keiran

keiranmraine commented 10 years ago

There's a typo too:

@@ -1038,7 +1043,7 @@ sub validate_and_build_gene {

                print STDERR "ERROR: seq_id conflict: " .
                  "validate_and_build_gene\n" if $seq_id ne $trnsc->{seq_id};
-               print STDERR "ERROR: sourc`e conflict: " .
+               print STDERR "ERROR: source conflict: " .
                  "validate_and_build_gene\n" if $source ne $trnsc->{source};
                print STDERR "ERROR: strand conflict: " .
                  "validate_and_build_gene\n" if $strand ne $trnsc->{strand};