load_gff3 miscalculates CDS

mpoelchau commented 1 year ago

We've been having trouble with load_gff3 - I initially posted this on the Apollo repo (https://github.com/GMOD/Apollo/issues/2662), reposting it here now. Thanks for considering this, I'd appreciate any pointers!

We are trying to use the python-apollo arrow annotations load_gff3 command to load annotations to the user-created annotations track. It is changing the CDS locations of the model, both with and without the --disable_cds_recalculation option.

Here is what a load without --disable_cds_recalculation looks like; the correct frame can be seen in the track below.

The gff3 that was used to load the annotation has 6 CDS lines; the gff3 for the uploaded annotation has 12 CDS lines (even though the view shows only one CDS segment). Apollo also won't calculate a protein or CDS sequence on the uploaded annotation.

Here is what a load with --disable_cds_recalculation looks like (command: arrow annotations load_gff3 --source https://apollo2-stage-node1-cbo.nal.usda.gov/apollo Anoplophora_glabripennis ~/Downloads/NW_019416298.gff3 --disable_cds_recalculation)

Again, the gff3 for the uploaded annotation in Apollo has 12 CDS lines instead of 6. Apollo also won't calculate a protein or CDS sequence on the uploaded annotation. I'll note that if you run the same command multiple times, the single CDS will display in a different spot each time.

If you load the annotation by dragging it up, it loads correctly:

This is happening for many (but not all) annotations in multiple assemblies/organisms.

Some other observations:

We haven't observed the problem for models with a single CDS/exon segment
The underlying genomic sequence has lowercase nucleotides
I tried using load_legacy_gff3, and that calculated the CDS correctly, but I'm unable to delete features when I load them with that method (Hibernate operation: could not execute statement; SQL [n/a]; ERROR: update or delete on table "feature" violates foreign key constraint "fk_8jm56covt0m7m0m191bc5jseh" on table "feature_relationship" Detail: Key (id)=(4858111) is still referenced from table "feature_relationship".; nested exception is org.postgresql.util.PSQLException: ERROR: update or delete on table "feature" violates foreign key constraint "fk_8jm56covt0m7m0m191bc5jseh" on table "feature_relationship" Detail: Key (id)=(4858111) is still referenced from table "feature_relationship".)

I've attached "before" and "after" gff3s. (Used .txt extension because GitHub wouldn't let me upload otherswise) before.txt after-nocdsrecalc.txt after.txt

Provide the javascript console log output generated from the action. None.
Provide the server log output generated from the action (typically catalina.out). nothing is added to Catalina.out when I add the annotations.

mpoelchau commented 12 months ago

I have a few more observations on this, I have no idea if they're expected/normal. I looked into the json that load_gff3 generates as output:

an additional CDS stanza is generated with start+end coordinates that encompass all CDS subfeatures
even when disable_CDS_recalculation is used, no phase information is included in the json output

json output attached from following command: arrow annotations load_gff3 --disable_cds_recalculation Anoplophora_glabripennis before.txt after.json

Happy to keep on sleuthing in another direction if needed... Thanks for your help with this!

MonicaPoelchau-USDA commented 12 months ago

Update - after looking at the gffs of a number of models after uploading, it looks like duplicate CDS features are being created. There are always 2x the number of CDS lines in the gff after loading, whether using --disable_CDS_recalculation or not.

I'm 99% certain it's not a problem with the gffs that I am trying to load - I've tried gffs from multiple sources, and our gff3_QC program doesn't detect any problems.

MonicaPoelchau-USDA commented 12 months ago

And another update - I was able to reproduce this on the Apollo demo site (human, region chr4:111819946..111848259 (28.31 Kb)). Input gff attached. arrow annotations load_gff3 --disable_cds_recalculation Human-Hg38 testhumandemo2.gff testhumandemo2.txt

hexylena commented 12 months ago

Hi @MonicaPoelchau-USDA , I'll try and find some time this week to look into it.

~~Do you have a single gene model you could share, that I could use as a test case?~~ Edit: yes you already attached that. Thank you!

MonicaPoelchau-USDA commented 12 months ago

Thanks for looking into this, @hexylena ! Let me know if there's anything else I can test/provide.

hexylena commented 10 months ago

Ok, investigating today (sorry again @MonicaPoelchau-USDA) I can reproduce the issue entirely.

I'm seeing zero functional differences with and without the no-cds-recalc, the files are functionally identical (module UUIDs), so that code path can at least be ignored for now: arrow sends the exact same data to apollo's addTranscript API.

As for the generated transcripts I'm getting back duplicated CDSs as well, only the addition of a number of non-canonical start sites (expected, i'm testing on an ecoli copy.)

I can likewise confirm the inability to delete when loading via legacy, that's really ... not great. It would require apollo changes to correct (or in your case maybe manually deleting it from the DB is an option.) I've tried deleting the CDSs one by one and got almost to the last one which now is just stuck there permanently lol.

I can confirm fetching the sequence does nothing, this is due to a 500 internal server error on the apollo side. Genomic and cdna both return sequences, but cds/peptide fail.

Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: 2024-01-26 14:29:41,353 [http-bio-8080-exec-45] ERROR apollo.FeatureRelationshipService  - More than one child feature relationships found for Feature{id=5502660, symbol='null', description='null', name='Mitotic spindle assembly checkpoint protein MAD1-00001', uniqueName='a45e77e6-673d-4ee4-a477-8126637a9426', sequenceLength=null, status=null, dateCreated=2024-01-26 14:21:33.563, lastUpdated=2024-01-26 14:21:33.992} and ID SO:0000316
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: 2024-01-26 14:29:41,360 [http-bio-8080-exec-45] ERROR errors.GrailsExceptionResolver  - NullPointerException occurred when processing request: [POST] /apollo/778475583185271421178966132/AnnotationEditorService
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: Cannot get property 'fmin' on null object. Stacktrace follows:
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: java.lang.NullPointerException: Cannot get property 'fmin' on null object
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.FeatureService.$tt__getSequenceAlterationsForFeature(FeatureService.groovy:2938)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.FeatureService.$tt__getResiduesWithAlterationsAndFrameshifts(FeatureService.groovy:1063)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.SequenceService.$tt__getSequenceForFeature(SequenceService.groovy:524)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.SequenceService.$tt__getSequenceForFeatures(SequenceService.groovy:654)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.AnnotationEditorController.getSequence(AnnotationEditorController.groovy:1054)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:198)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.grails.plugins.web.rest.api.ControllersRestApi.forward(ControllersRestApi.groovy:53)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.bbop.apollo.AnnotationEditorController.handleOperation(AnnotationEditorController.groovy:72)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:198)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.web.servlet.AbstractShiroFilter.executeChain(AbstractShiroFilter.java:449)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.web.servlet.AbstractShiroFilter$1.call(AbstractShiroFilter.java:365)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.subject.support.SubjectCallable.doCall(SubjectCallable.java:90)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.subject.support.SubjectCallable.call(SubjectCallable.java:83)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.subject.support.DelegatingSubject.execute(DelegatingSubject.java:383)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.web.servlet.AbstractShiroFilter.doFilterInternal(AbstractShiroFilter.java:362)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at com.brandseye.cors.CorsFilter.doFilter(CorsFilter.java:82)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Jan 26 14:29:41 apollo.internal.galaxyproject.eu server[898]: at java.lang.Thread.run(Thread.java:750)

Which is seen in the apollo codebase here. https://github.com/GMOD/Apollo/blob/5985ab8d98cc3b98625101f649c9f300efb5b555/grails-app/services/org/bbop/apollo/FeatureRelationshipService.groovy#L30

I, don't understand why this would be an error, having more than 1 child feature. Your gene model is gff3 spec compliant, we're sending the correct data to apollo (as evidenced by what it'll show), but there's still something so unusual about this feature for Apollo.

Uploading the GFF3 and manually creating the annotation works perfectly, but that's not what we want. Looking at the uploaded feature in JBrowse's view, it looks right, 6 exons and 6 CDSs.

Investigating the websocket call that's made when I drag and drop this feature onto the annotation track is fascinating, it's very different!

At the top level an mRNA is sent, along with the following:

14:43:33 w-galaxy:[~/arbeit/genomics/python-apollo @master‽]$ cat tmp2 | jq .features[0].children[] -c
{"location":{"fmin":1178316,"fmax":1179335,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"8f46523a-680c-4f42-ae8c-45b61dc37928"}
{"location":{"fmin":1179393,"fmax":1179530,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"4d5bc38f-c85b-4ecd-bd42-1f2a036c90d6"}
{"location":{"fmin":1179585,"fmax":1179768,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"834ee674-1dd2-405a-948f-96140f2c2502"}
{"location":{"fmin":1179824,"fmax":1180016,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"046e9b0f-0b2f-45b4-98ff-4b11482e073e"}
{"location":{"fmin":1180209,"fmax":1180546,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"888dc728-2f2a-42e8-8a12-ad298b5189fe"}
{"location":{"fmin":1183892,"fmax":1184805,"strand":1},"type":{"cv":{"name":"sequence"},"name":"exon"},"orig_id":"faf5e3e9-f1de-4675-8c1c-b73e8f8a94bf"}
{"location":{"fmin":1178671,"fmax":1184578,"strand":1},"type":{"cv":{"name":"sequence"},"name":"CDS"}}

Notably it's only exons except for where the CDS doesn't match the parent exon.

So that looks like we're calling the addTranscript API incorrectly? That doesn't seem right, but, perhaps.

ok, so, deleting any CDSs which match their parent exons completely, before sending produces the thing that looks the closest to what was copied from the native jbrowse track (MAD1f, the last track)

But that still fails to get the correct sequence from Apollo.

The thing that finally worked was... dropping ALL cds features from the source.

I think in this case we're getting "lucky" that there weren't any manual modifications made to the CDS regions and apollo's algorithm works, but, yikes that's not good. Deleting all CDSs from the source gff3 file does not work (weirdly), instead I need this patch to get the loading to work at all:

diff --git a/apollo/util.py b/apollo/util.py
index 802625b..11529cd 100644
--- a/apollo/util.py
+++ b/apollo/util.py
@@ -121,6 +121,28 @@ def _yieldGeneData(gene, disable_cds_recalculation=False, use_name=False):
     # # TODO: handle description
     # # TODO: handle GO, Gene Product, Provenance

+    def __floc(location):
+        return f"{location['fmin']}-{location['fmax']}-{location['strand']}"
+
+    for child1 in current['children']:
+        exon_regions = []
+        for child1 in current['children']:
+            for child in child1['children']:
+                print(child)
+                if child['type']['name'] == 'exon': 
+                    exon_regions.append(__floc(child['location']))
+        new_current_children = []
+        for child in child1['children']:
+            if child['type']['name'] == 'CDS':
+                continue
+                nnn = __floc(child['location'])
+                if nnn not in exon_regions:
+                    new_current_children.append(child)
+            else:
+                new_current_children.append(child)
+        child1['children'] = new_current_children
+        print(exon_regions)
+
     if 'children' in current and gene.type == 'gene':
         # Only sending mRNA level as apollo is more comfortable with orphan mRNAs
         return current['children']

I don't love any of this :upside_down_face: I'm not sure what we could change to make Apollo happier here, short of just removing the CDSs blindly. For whatever reason sending CDSs causes problems

mpoelchau commented 10 months ago

Thanks @hexylena ! I pulled your branch, was able to add the annotations, and they loaded correctly - I was able to retrieve the protein and CDS sequence. I was also able to load them correctly using the main branch and deleting all CDS features from the gff3 file beforehand, as you mentioned. I think this will work for us in most cases since our CDS's should be pretty vanilla (starting in phase 0).

garrettjstevens commented 10 months ago

ok, so, deleting any CDSs which match their parent exons completely, before sending produces the thing that looks the closest to what was copied from the native jbrowse track (MAD1f, the last track)

I wasn't around when this was written, but this is my best guess of what's going on here:

What Apollo is expecting here for CDS is a single feature that spans the whole coding region. In the sample GFF3 file, all the CDSs have the same ID in column 9, and I think Apollo similarly considers a CDS a single feature, but just keeps track of the start of the first CDS section and the end of the last CDS section since the rest can be inferred from the exon locations.

With this patch, I'm able to load the GFF3 correctly, and it contains the CDS info from the file:

diff --git a/apollo/util.py b/apollo/util.py
--- a/apollo/util.py
+++ b/apollo/util.py
@@ -122,6 +122,25 @@ def _yieldGeneData(gene, disable_cds_recalculation=False, use_name=False):
     # # TODO: handle GO, Gene Product, Provenance

     if 'children' in current and gene.type == 'gene':
+        for mRNA in current['children']:
+            new_mRNA_children = []
+            new_cds = None
+            for feature in mRNA['children']:
+                if feature['type']['name'] == 'CDS':
+                    if new_cds:
+                        new_cds_start = new_cds['location']['fmin']
+                        new_cds_end = new_cds['location']['fmax']
+                        this_cds_start = feature['location']['fmin']
+                        this_cds_end = feature['location']['fmax']
+                        new_cds['location']['fmin'] = min(new_cds_start, this_cds_start)
+                        new_cds['location']['fmax'] = max(new_cds_end, this_cds_end)
+                    else:
+                        new_cds = feature
+                else:
+                    new_mRNA_children.append(feature)
+            if new_cds:
+                mRNA['children'] = new_mRNA_children
+                mRNA['children'].append(new_cds)
         # Only sending mRNA level as apollo is more comfortable with orphan mRNAs
         return current['children']
     else:

hexylena commented 10 months ago

Oh wow @garrettjstevens thanks for looking into it as well!

What Apollo is expecting here for CDS is a single feature that spans the whole coding region.

that is a fascinating interpretation of the spec.

With this patch

Ok, we'll merge that in behind a flag then!

aathbt commented 9 months ago

I just encountered the same problem. The fixes you wrote allowed me to properly import the gff, but I now have several exons/transcripts I cannot delete (I get the same error than @mpoelchau). I'd appreciate any pointers on how to solve this!

mpoelchau commented 9 months ago

As a temporary hack, we have been deleting CDS features from the gff3 before loading them. This is not ideal though (esp if your CDS start and stop don't follow Apollo's assumptions when it recalculates them).

aathbt commented 9 months ago

Alright, thanks. Related question: were you able to delete the user-created annotations that you uploaded via load_legacy_gff3? If so, how? I've tried this using both arrow organisms delete_features and the GUI, but I always end up with the update or delete on table "feature" violates foreign key constraint' error (same than in your first post).

hexylena commented 9 months ago

@aathbt I was not able to remove the undeletable exons. The ebst result I had was by removing the CDSs, then one could get the exon almost deleted, but something was wrong in the database still. I'm not sure it's something we can fix via the API either

galaxy-genome-annotation / python-apollo

load_gff3 miscalculates CDS #60