Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
449 stars 151 forks source link

Question regarding GTF or GFF usage. #1554

Open eprdz opened 10 months ago

eprdz commented 10 months ago

In order to use VEP, I made an AUGUSTUS GFF annotation file. Here you can see the GFF structure for the first gene:

SuperScaffold   AUGUSTUS        gene    1       5061    0.02    -       .       ID=g1
SuperScaffold   AUGUSTUS        transcript      1       5061    0.02    -       .       ID=g1.t1;Parent=g1
SuperScaffold   AUGUSTUS        transcription_end_site  1       1       .       -       .       Parent=g1.t1
SuperScaffold   AUGUSTUS        exon    1       1892    .       -       .       Parent=g1.t1
SuperScaffold   AUGUSTUS        stop_codon      14      16      .       -       0       Parent=g1.t1
SuperScaffold   AUGUSTUS        intron  1893    1965    1       -       .       Parent=g1.t1
SuperScaffold   AUGUSTUS        intron  2100    2632    0.95    -       .       Parent=g1.t1
SuperScaffold   AUGUSTUS        intron  2756    2885    0.99    -       .       Parent=g1.t1
SuperScaffold   AUGUSTUS        intron  3483    3737    0.95    -       .       Parent=g1.t1
SuperScaffold   AUGUSTUS        CDS     14      1892    1       -       1       ID=g1.t1.cds;Parent=g1.t1
SuperScaffold   AUGUSTUS        CDS     1966    2099    0.99    -       0       ID=g1.t1.cds;Parent=g1.t1
SuperScaffold   AUGUSTUS        exon    1966    2099    .       -       .       Parent=g1.t1
SuperScaffold   AUGUSTUS        CDS     2633    2755    0.99    -       0       ID=g1.t1.cds;Parent=g1.t1
SuperScaffold   AUGUSTUS        exon    2633    2755    .       -       .       Parent=g1.t1
SuperScaffold   AUGUSTUS        CDS     2886    3482    0.97    -       0       ID=g1.t1.cds;Parent=g1.t1
SuperScaffold   AUGUSTUS        exon    2886    3482    .       -       .       Parent=g1.t1
SuperScaffold   AUGUSTUS        CDS     3738    3794    0.42    -       0       ID=g1.t1.cds;Parent=g1.t1
SuperScaffold   AUGUSTUS        exon    3738    3794    .       -       .       Parent=g1.t1
SuperScaffold   AUGUSTUS        start_codon     3792    3794    .       -       0       Parent=g1.t1
SuperScaffold   AUGUSTUS        exon    4993    5061    .       -       .       Parent=g1.t1
SuperScaffold   AUGUSTUS        transcription_start_site        5061    5061    .       -       .       Parent=g1.t1

Nevertheless, all variants from my VCF file appeared as non-coding, although I know that some of them are coding variants.

I read in the GTF and GFF format expectations that if the annotation file has GFF format, it is necessary that in the 9th column appears a biotype parameter (indeed, the AUGUSTUS GFF file did not have this parameter in the 9th column) Nevertheless, for the GTF file, if the biotype parameter does not appear in the 9th column, the biotype will be inferred from the source (2nd column).

In checked your code and indeed you do this. This code is from ensembl‑vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File/GTF.pm:

=head2 _record_get_biotype

  Arg 1      : hashref $transcript_record_hash
  Example    : $biotype = $as->_record_get_biotype($tr_record);
  Description: Get sequence ontology (SO) biotype of this record. Attempts to
               find it in the "biotype", "transcript_type" or "transcript_biotype"
               attribute fields, and if that fails default to the source field.
  Returntype : string
  Exceptions : none
  Caller     : general
  Status     : Stable

=cut

sub _record_get_biotype {
  return
    $_[1]->{attributes}->{transcript_biotype} ||
    $_[1]->{attributes}->{transcript_type} ||
    $_[1]->{attributes}->{biotype} ||
    $_[1]->{source};
}

1;

And this is from ensembl‑vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File/GFF.pm:

=head2 _record_get_biotype

  Arg 1      : hashref $transcript_record_hash
  Arg 2      : hashref $gene_record_hash
  Example    : $biotype = $as->_record_get_biotype($tr_record, $gene_record);
  Description: Get sequence ontology (SO) biotype of this record. Attempts to
               find it in the "biotype", "transcript_type" or
               "transcript_biotype" attribute fields,
               and if that fails (as it will for RefSeq GFFs), make an
               educated guess looking at the record type and various other
               attributes.
  Returntype : string
  Exceptions : none
  Caller     : general
  Status     : Stable

=cut

sub _record_get_biotype {
  my ($self, $record, $gene_record) = @_;

  if(!exists($record->{_biotype})) {

    # Ensembl-y GFFs have biotype as an attribute
    my $biotype = $record->{attributes}->{biotype} ||
                  $record->{attributes}->{transcript_type} ||
                  $record->{attributes}->{transcript_biotype};

    # others we need to (guess) work it out
    if(!$biotype) {
      my $type = lc($record->{type});

      if($type eq 'mrna') {
        $biotype = 'protein_coding';
      }
      elsif($type eq 'ncrna') {
        $biotype = $record->{attributes}->{ncrna_class};
      }
      elsif($type =~ /^([a-z]+)_gene_segment$/) {
        $biotype = 'IG_'.uc($1).'_gene';
      }
      elsif($gene_record && ($gene_record->{attributes}->{description} || '') =~ /^microRNA/) {
        $biotype = 'miRNA';
      }
      elsif($record->{attributes}->{gbkey}) {
        $biotype = $record->{attributes}->{gbkey};
      }
    }

    $record->{_biotype} = $biotype;
  }

  return $record->{_biotype};
}

Can the biotype be inferred from the second column in the GFF file too? If not, why? Thank you in advance.

likhitha-surapaneni commented 10 months ago

Hi @eprdz ,

Thank you for reaching out. The logic for fetching biotype makes sense to be consistent across GFF and GTF file. We are going to work on this and keep you updated.

Kind regards, Likhitha

eprdz commented 10 months ago

Thank you for answering me. I was wondering if you know which of the 2 is the correct approach or if there is a third option. Can you point me in the right direction (documentation, a colleague) to find out more about this so I can get a timely solution for this? By now, what I do is execute AGAT to transform GFF files to GTF v.2.5 files in order to overcome the need of having the biotype in the 9th column. Do you think this is feasible? Thanks in advance for your help.

likhitha-surapaneni commented 10 months ago

Hi @eprdz , Thank you for your patience.

In GFF files, biotype is fetched from "biotype", "transcript_type", "transcript_biotype" of the attribute fields, there are also some additional checks for biotype. In the future update, if the biotype is still empty in spite of all the checks, then it will be fetched from the source column (similar to GTF). Converting to GTF file should be feasible in the meantime.

Kind regards, Likhitha