galaxy-genome-annotation / python-apollo

Python library for talking to Apollo API
MIT License
11 stars 11 forks source link

Can't load_gff3 #63

Open RenanFerreira0412 opened 2 months ago

RenanFerreira0412 commented 2 months ago

Hi everyone!

I’m using the arrow command arrow annotations load_gff3 to load a full GFF3 into an annotation track, but nothing's happening.

The version of my plugin: apollo 4.2.13

Command: arrow annotations load_gff3 [OPTIONS] ORGANISM GFF3

My command: arrow annotations load_gff3 Leishmania /home/renanigor/Downloads/TriTrypDB-67_LdonovaniBPK282A1.gff

OBS: I’m using the docker to run the Apollo

My organism: arrow organisms show_organism Leishmania { "commonName": "Leishmania", "blatdb": "/data/temporary/apollo_data/34-Leishmania/seq/Leishmania.fa.2bit", "metadata": "{\"creator\":\"32\"}", "annotationCount": 2, "currentOrganism": true, "obsolete": false, "sequences": 36, "directory": "/data/temporary/apollo_data/34-Leishmania", "publicMode": false, "valid": true, "genomeFastaIndex": "seq/Leishmania.fa.fai", "genus": null, "species": "donovani", "id": 34, "nonDefaultTranslationTable": null, "genomeFasta": "seq/Leishmania.fa" }

I really don’t know what the actual problem is because there are no error log messages.

When I run the command, the output is just empty braces.

(apollo_env) renanigor@pop-os:~/VirtualEnvs$ arrow annotations load_gff3 Leishmania /home/renanigor/Downloads/TriTrypDB-67_LdonovaniBPK282A1.gff {}

Does anyone know how I can fix this?

hexylena commented 2 months ago

Could you try with increased logging arrow --verbose -l debug annotations load_gff3? that'll give us more information as to why it's failing

RenanFerreira0412 commented 2 months ago

Now he's processing all the sequences from my GFF file, but when I refresh the Apollo page, the GFF file is not loaded into the annotation track.

apollo

The GFF file and the FASTA file with the sequence that I'm using can be found here: https://tritrypdb.org/tritrypdb/app/downloads/Current_Release/LdonovaniBPK282A1/

OBS: My GFF file has 36 sequences.

The output was too big, so this is just the ending part of it.

. . . DEBUG:root:unknown type protein_coding_gene INFO:root:Processing Ld36_v01s1 with features: [SeqFeature(SimpleLocation(ExactPosition(1019), ExactPosition(1163), strand=-1), type='protein_coding_gene', id='LdBPK_360010.1', qualifiers=...), SeqFeature(SimpleLocation(ExactPosition(3957), ExactPosition(4260), strand=-1), type='protein_coding_gene', id='LdBPK_360020.1', qualifiers=...), SeqFeature(SimpleLocation(ExactPosition(6202), ExactPosition(6661), strand=-1), type='protein_coding_gene', id='LdBPK_360030.1', qualifiers=...), ... . . . DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:unknown type protein_coding_gene DEBUG:root:writing out: [] DEBUG:root:empty list, no more features to write DEBUG:root:writing out: [] DEBUG:root:empty list, no more features to write INFO:root:Finished loading {}

hexylena commented 2 months ago

what's your gff look like? I'm guessing it doesn't match our expected structure hence this result.

edit: ah you linked to it, ok, ill take a look when i can (apologies, not much spare time currently)

hexylena commented 2 months ago

Looking at the gff it does follow roughly the expected model, with the change of protein_coding_gene rather than just gene.

Ld01_v01s1      VEuPathDB       protein_coding_gene     3662    4663    .       -       .       ID=LdBPK_010010.1;description=Protein of unknown function (DUF2946)%2C putative;ebi_biotype=protein_coding
Ld01_v01s1      VEuPathDB       mRNA    3662    4663    .       -       .       ID=LdBPK_010010.1.1;Parent=LdBPK_010010.1;description=Protein of unknown function (DUF2946)%2C putative;gene_ebi_biotype=protein_coding
Ld01_v01s1      VEuPathDB       exon    3662    4663    .       -       .       ID=exon_LdBPK_010010.1.1-E1;Parent=LdBPK_010010.1.1;gene_id=LdBPK_010010.1
Ld01_v01s1      VEuPathDB       CDS     3662    4663    .       -       0       ID=LdBPK_010010.1.1-p1-CDS1;Parent=LdBPK_010010.1.1;gene_id=LdBPK_010010.1;protein_source_id=LdBPK_010010.1.1-p1

it could be fixed either by changing protein_coding_gene to gene in your GFF file, or by updates to python-apollo.

https://github.com/GMOD/Apollo/blob/develop/client/apollo/js/SequenceOntologyUtils.js#L55 suggests that it's a valid feature as far as apollo is concerned, so likely we should expand to include some of these other terms (@abretaud what do you think), but until now we've been a bit cautious to only support structures we've seen before, lest this library cause any issues. It looks like ncRNA_gene is also used, so, clearly multiple top level features we've never seen before.

You can patch this yourself quickly by editing apollo/util.py to add your types to the gene_types list which may be faster than waiting on a new release of this library

abretaud commented 2 months ago

Yeah we could support other top level feature types, no time to change the code for now, but feel free to propose a PR (or just modify the input gff to the expected gene type)

RenanFerreira0412 commented 2 months ago

Oh, I see. I tried adding the types in the apollo/util.py file as you suggested, and it worked.

He loaded all the features in the annotation track, but some of them were loaded with an exclamation mark.

tela1

I'm not sure why this happened.

These are the modifications I made in the apollo/util.py file.

tela2

tela3

Thanks for the help.

abretaud commented 2 months ago

Questions marks only represent non-canonical splice sites: it's just a visual warning for curators in case they want to check carefully the splice site position