galaxy-genome-annotation / python-apollo

Python library for talking to Apollo API
MIT License
11 stars 11 forks source link

Bug in load_gff3 #31

Closed abretaud closed 4 years ago

abretaud commented 4 years ago

@nathandunn it looks like there is a problem with the new load_gff3 method, could you look into it?

I get this error when trying to export CDS or peptide fasta:

java.lang.NullPointerException: Cannot get property 'fmin' on null object
    at org.codehaus.groovy.runtime.NullObject.getProperty(NullObject.java:60)
    at org.codehaus.groovy.runtime.InvokerHelper.getProperty(InvokerHelper.java:172)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callGroovyObjectGetProperty(AbstractCallSite.java:302)
    at org.bbop.apollo.FeatureService.$tt__getSequenceAlterationsForFeature(FeatureService.groovy:2933)
    at org.bbop.apollo.FeatureService$_getSequenceAlterationsForFeature_closure73.doCall(FeatureService.groovy)

And when I export the GFF, CDS features are missing (which explain the error above I guess). After loading Merlin.gff, the exported GFF should look like this, but instead it looks like this:

##gff-version 3
##sequence-region Merlin 1 172788
Merlin  .   gene    2   691 .   +   .   owner=admin@local.host;gene_product=;ID=d5bc7673-ba97-4e27-a173-30a84f8cab32;date_last_modified=2020-06-08;Name=Merlin_1;date_creation=2020-06-08
Merlin  .   mRNA    2   691 .   +   .   owner=admin@local.host;Parent=d5bc7673-ba97-4e27-a173-30a84f8cab32;gene_product=;ID=353b6386-45c9-47ac-bb1b-6ea3ff2c9a80;date_last_modified=2020-06-08;Name=Merlin_1-00001;date_creation=2020-06-08
Merlin  .   exon    2   691 .   +   .   Parent=353b6386-45c9-47ac-bb1b-6ea3ff2c9a80;ID=12ca2f58-6beb-4a24-9016-793ea6e842fb;Name=12ca2f58-6beb-4a24-9016-793ea6e842fb
###
Merlin  .   gene    752 1039    .   +   .   owner=admin@local.host;gene_product=;ID=70598027-a4c6-44d2-97d9-543a38430ae6;date_last_modified=2020-06-08;Name=Merlin_2;date_creation=2020-06-08
Merlin  .   mRNA    752 1039    .   +   .   owner=admin@local.host;Parent=70598027-a4c6-44d2-97d9-543a38430ae6;gene_product=;ID=9404c8e3-5b91-4a4d-9a07-cfdfab208b41;date_last_modified=2020-06-08;Name=Merlin_2-00001;date_creation=2020-06-08
Merlin  .   exon    752 1039    .   +   .   Parent=9404c8e3-5b91-4a4d-9a07-cfdfab208b41;ID=c6e83e82-29f8-44d2-bccb-ad7984cfa27e;Name=c6e83e82-29f8-44d2-bccb-ad7984cfa27e
###
Merlin  .   gene    1067    2011    .   -   .   owner=admin@local.host;gene_product=;ID=3a66c7d4-f338-4e89-b531-6e829543b234;date_last_modified=2020-06-08;Name=Merlin_3;date_creation=2020-06-08
Merlin  .   mRNA    1067    2011    .   -   .   owner=admin@local.host;Parent=3a66c7d4-f338-4e89-b531-6e829543b234;gene_product=;ID=d3738213-2fde-4319-938a-65a3fcaa84f3;date_last_modified=2020-06-08;Name=Merlin_3-00001;date_creation=2020-06-08
Merlin  .   exon    1067    2011    .   -   .   Parent=d3738213-2fde-4319-938a-65a3fcaa84f3;ID=dab8e05c-aad6-4f36-a300-c035da9e4d4f;Name=dab8e05c-aad6-4f36-a300-c035da9e4d4f
###
Merlin  .   gene    2011    3066    .   -   .   owner=admin@local.host;gene_product=;ID=223a58ae-2377-4169-b41d-7f7496c7ee09;date_last_modified=2020-06-08;Name=Merlin_4;date_creation=2020-06-08
Merlin  .   mRNA    2011    3066    .   -   .   owner=admin@local.host;Parent=223a58ae-2377-4169-b41d-7f7496c7ee09;gene_product=;ID=1b48f7a9-8a94-47df-8ff4-3424d92fb528;date_last_modified=2020-06-08;Name=Merlin_4-00001;date_creation=2020-06-08
Merlin  .   exon    2011    3066    .   -   .   Parent=1b48f7a9-8a94-47df-8ff4-3424d92fb528;ID=8d272442-cdb5-4961-8607-3a920943c64b;Name=8d272442-cdb5-4961-8607-3a920943c64b
###
Merlin  .   gene    3066    4796    .   -   .   owner=admin@local.host;gene_product=;ID=543eb53b-f898-432b-a114-bf89b1c2be83;date_last_modified=2020-06-08;Name=multiexongene;date_creation=2020-06-08
Merlin  .   mRNA    3066    4796    .   -   .   owner=admin@local.host;Parent=543eb53b-f898-432b-a114-bf89b1c2be83;gene_product=;ID=a5ae1888-1a81-4420-b7b1-daac63caf653;date_last_modified=2020-06-08;Name=multiexongene-00001;date_creation=2020-06-08
Merlin  .   exon    3066    4296    .   -   .   Parent=a5ae1888-1a81-4420-b7b1-daac63caf653;ID=858cf3c7-2444-4dfe-8661-0fb8a9c1a933;Name=858cf3c7-2444-4dfe-8661-0fb8a9c1a933
Merlin  .   non_canonical_five_prime_splice_site    4364    4364    .   -   .   Parent=a5ae1888-1a81-4420-b7b1-daac63caf653;ID=a5ae1888-1a81-4420-b7b1-daac63caf653-non_canonical_five_prime_splice_site-4363;Name=a5ae1888-1a81-4420-b7b1-daac63caf653-non_canonical_five_prime_splice_site-4363
Merlin  .   exon    4366    4796    .   -   .   Parent=a5ae1888-1a81-4420-b7b1-daac63caf653;ID=e3a25898-5be0-4925-af72-c6eb7cd2eeb4;Name=e3a25898-5be0-4925-af72-c6eb7cd2eeb4
Merlin  .   non_canonical_three_prime_splice_site   4297    4297    .   -   .   Parent=a5ae1888-1a81-4420-b7b1-daac63caf653;ID=a5ae1888-1a81-4420-b7b1-daac63caf653-non_canonical_three_prime_splice_site-4296;Name=a5ae1888-1a81-4420-b7b1-daac63caf653-non_canonical_three_prime_splice_site-4296
###
Merlin  .   gene    5011    6066    .   -   .   owner=admin@local.host;gene_product=;ID=5b5f8d69-06bd-4f5a-94da-61bf43263c05;date_last_modified=2020-06-08;Name=cds-not-under-exon;date_creation=2020-06-08
Merlin  .   mRNA    5011    6066    .   -   .   owner=admin@local.host;Parent=5b5f8d69-06bd-4f5a-94da-61bf43263c05;gene_product=;ID=22e127db-94ed-4b4a-9d4a-6b8adfb284e3;date_last_modified=2020-06-08;Name=cds-not-under-exon-00001;date_creation=2020-06-08
Merlin  .   exon    5011    6066    .   -   .   Parent=22e127db-94ed-4b4a-9d4a-6b8adfb284e3;ID=25e7de9e-850d-4194-8972-33b86d79e4b1;Name=25e7de9e-850d-4194-8972-33b86d79e4b1

I've added a few tests for data export in GFF/VCF/FASTA formats, some of them are failing due to this bug

nathandunn commented 4 years ago

@abretaud I'm taking a look now.

nathandunn commented 4 years ago

I see what I did wrong. If a gene with an mRNA subtype then process as an mRNA:

https://github.com/GMOD/Apollo/blob/develop/tools/data/add_features_from_gff3_to_annotations.pl#L629-L655

I was not doing that check. I think that will fix it, but we'll see.

nathandunn commented 4 years ago

Also isn't sending the exon, which is a larger problem. The top versus the bottom.

addTranscript {"suppressEvents":false,"features":[{"name":"Merlin_1_mRNA","location":{"strand":1,"fmin":1,"fmax":691},"type":{"cv":{"name":"sequence"},"name":"mRNA"}}],"sequence":"Merlin","password":"password","organism":"test_cds","clientToken":"ignore","suppressHistory":false,"username":"admin@local.host"}
addTranscript {"features":[{"children":[{"children":[{"location":{"strand":1,"fmin":1,"fmax":691},"type":{"cv":{"name":"sequence"},"name":"CDS"}},{"location":{"strand":1,"fmin":1,"fmax":691},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"Merlin_1_CDS"}],"location":{"strand":1,"fmin":1,"fmax":691},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"Merlin_1_exon"}],"name":"Merlin_1_mRNA","location":{"strand":1,"fmin":1,"fmax":691},"type":{"cv":{"name":"sequence"},"name":"mRNA"},"orig_id":"Merlin_1_mRNA"}],"clientToken":"14540178701615202211144956811","track":"Merlin","operation":"add_transcript","username":"admin@local.host"}
nathandunn commented 4 years ago

Working:

image

current:

image

Not adding children . . so will have to fix this.

nathandunn commented 4 years ago

The problem is that it is checking feature by feature instead the whole thing as a group (i.e., should be using rec instead of rec.features. this should be more efficient.

nathandunn commented 4 years ago

So, the problem is that it was trying to write out only a single line of GFF3 at a time. I'll look back through the code, but it might have been serendipity that we didn't run into this sooner.

Anyway, if you look at #33 you can see the start of my fixes.

abretaud commented 4 years ago

Fixed in #33