geneontology / noctua

Graph-based modeling environment for biology, including prototype editor and services
http://noctua.geneontology.org/
BSD 3-Clause "New" or "Revised" License
38 stars 12 forks source link

Question: can't locate complexes specified in our GPI #910

Open ValWood opened 3 months ago

ValWood commented 3 months ago

I can't find this complex in Noctua:

ComplexAc: CPX-566

Even though it has been in our GPI since 2024-05-04

Can you let us know of we are doing anything wrong in the file or is a Noctua loading issue? I tried both "activity unit" and the "protein complex" entity annotons

ValWood commented 3 months ago

cc @kimrutherford

kltm commented 3 months ago

Noting that this should be ComplexPortal:CPX-566 (as http://noctua-amigo.berkeleybop.org/amigo/term/ComplexPortal:CPX-566 in our side instance)

I am able to find this term in at least some locations in the Noctua interface.

ValWood commented 3 months ago

So I should always put ComplexPortal:CPX-566? I find other complex IDs without the prefix, and we don't need the prefix to locate other entities?

cc @PCarme

kltm commented 3 months ago

@ValWood We can dig into this a little when @vanaukenk is back, but it may be that the difference is what is supplied in the synonyms, etc.

ValWood commented 3 months ago

Are there any docs for how to specify complexes in GPI ? @kimrutherford can check that we are doing it correctly. No hurry until @vanaukenk is back.

ValWood commented 2 months ago

Hi @vanaukenk can you let us know how complexes should be specified in the GPAD so we can check that we are doing it correctly? Thanks, val

hattrill commented 2 months ago

@vanaukenk I have just got our devs to add complexes to our gpi (not in production yet) based on SGD's gpi and would like to check that the file is spec'd correctly as well.

vanaukenk commented 2 months ago

@ValWood @hattrill

There are some issues surrounding use of ComplexPortal ids in GO-CAMs that need to be definitively resolved. I propose that we use next week's Pathways2GO and GO-CAM call time slots to focus on that and then we will know better what to do wrt the gpi file.

Are you both available next Thursday?

hattrill commented 2 months ago

@vanaukenk that is good for me. Thanks

pgaudet commented 2 months ago

Hi @vanaukenk can you let us know how complexes should be specified in the GPAD so we can check that we are doing it correctly?

Is this helpful?

https://geneontology.org/docs/gene-product-information-gpi-format/

We can add a protein complex example.

kimrutherford commented 2 months ago

Is this helpful? https://geneontology.org/docs/gene-product-information-gpi-format/ We can add a protein complex example.

Thanks Pascale.

We've been using the GPI 2.0 spec: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md

Perhaps that's a problem?

kimrutherford commented 2 months ago

We've been using the GPI 2.0 spec:

We're putting the complex members in column 9 ("Protein_Containing_Complex_Members"), following the spec. The spec says Prefix ':' Local_ID separated by pipes. Prefix in our case is "PomBase". The example in the spec has UniProt IDs. Maybe that causes a problem?

This is an example of column 9 from our GPI file:

PomBase:SPAC29E6.08|PomBase:SPBC13E7.10c|PomBase:SPCC1919.14c
suzialeksander commented 2 months ago

Noting here there are still some issues with SGD's complexes:

https://release.geneontology.org/2024-09-08/annotations/sgd.gpi.gz has

SGD:S000218003 CPX-1852 RPD3L histone deacetylase complex|RPD3L complex|RPD3(L)|RPD3/SIN3 large histone deacetylase complex|3.5.1.98|8GA8|EMD-29892|8HPO|EMD-34935 protein_complex taxon:559292

https://release.geneontology.org/2024-09-08/products/upstream_and_raw_data/sgd-src.gpi.gz has:

SGD:S000218003 ASH1:CTI6:DEP1:PHO23:RPD3:RXT2:RXT3:SAP30:SDS3:SIN3:UME1:UME6 RPD3L histone deacetylase complex GO:0032991 taxon:559292 S000005274|S000005274|S000001346|S000001346|S000001346|S000005364|S000005364|S000005364|S000005364|S000000299|S000000299|S000000299|S000000299|S000000299|S000006060|S000006102|S000004876|S000004876|S000005041|S000005041|S000005041|S000002234|S000002234|S000000011|S000000011|S000000011 ComplexPortal:CPX-1852

To find this complex in Noctua, the only current way is to enter S000218003 or SGD:S000218003 in the Term box, where the entity pops up with as ASH1CTI6DEP1PHO23RPD3RXT2RXT3SAP30SDS3SIN3UME1UME6 Scer. Curators expect CPX-1852 to work but doesn't, although I've found that searching for ASH in the Term box also works, but that's not an obvious name and isn't in the GPI provided by SGD.

SGD is modifying the supplied GPI and the next available GPI from SGD will look more like the /annotations/sgd.gpi :

SGD:S000218003 CPX-1852 RPD3L histone deacetylase complex GO:0032991 taxon:559292 S000005274|S000005274|S000001346|S000001346|S000001346|S000005364|S000005364|S000005364|S000005364|S000000299|S000000299|S000000299|S000000299|S000000299|S000006060|S000006102|S000004876|S000004876|S000005041|S000005041|S000005041|S000002234|S000002234|S000000011|S000000011|S000000011 ComplexPortal:ASH1:CTI6:DEP1:PHO23:RPD3:RXT2:RXT3:SAP30:SDS3:SIN3:UME1:UME6

Strongly related ticket https://github.com/geneontology/noctua/issues/914

ValWood commented 1 month ago

Update, I have been able to locate complexes only If I omit the hyphen from the identifier. So, if I search for "CPX 566" instead of the actual ID "CPX-566" I find it???

ValWood commented 1 month ago

...but the has_parts are not automatically imported

vanaukenk commented 1 month ago

I'm looking into this some more today.

So far, what I find when searching in the gene product field is:

CPX-566 does return the right complex, but it is very far down on the autocomplete selection list, i.e. the 40th entity listed CPX 566 floats the entry to the top of the list ComplexPortal:CPX-566 floats the entry to the top of the list CPX-566 SPom has the entry second in the list

The search behavior is the same in the VPE as well as the standard annotation editor. I'll ask @tmushayahama about the search criteria to see if there's anything we can do to bump the right enty to the top of the search list when using CPX-566, as I am assuming that's the entry you'd most likely make? @ValWood

@suzialeksander I'm still looking into the SGD issues, as I can't find the SGD complexes in noctua-amigo, suggesting that this is a different problem.

vanaukenk commented 1 month ago

@suzialeksander

I've been looking into the SGD gpi and protein complexes and honestly don't understand what's happening here. I see the exact same behavior you see.

I'll need some help troubleshooting from @kltm and @tmushayahama

vanaukenk commented 1 month ago

...but the has_parts are not automatically imported

@ValWood - we haven't done any work yet to implement this functionality, but are aware it would be very helpful.

ValWood commented 1 month ago

if there's anything we can do to bump the right enty to the top of the search list when using CPX-566, as I am assuming that's the entry you'd most likely make?

Yes. Its strange that IDs with spaces take priority over the correct identifier. As far as I'm aware, identifiers never have spaces?

kltm commented 1 month ago

I wanted to clarify a little about what is going on here wrt CPX-566. I'm not justifying it or saying it's good--issues with the autocomplete are well known and numerous https://github.com/geneontology/amigo/issues/131https://github.com/geneontology/amigo/issues/120 https://github.com/geneontology/amigo/issues/102, but I wanted to give context for the mechanisms here.

I'd have to look into the exact math to be sure, but essentially, when looking for http://noctua-amigo.berkeleybop.org/amigo/term/ComplexPortal:CPX-566 , there are a few ways to get at it.

If we look at the general index search on the noctua autocomplete AmiGO instance (http://noctua-amigo.berkeleybop.org, upper-right):

ComplexPortal:CPX-566: first result CPX-566: first result CPX 566: second result, with first result being "understandable" "CPX-566": first and only result CPX-566 Spom: first result

If we look at the "Filter by Term" "ontology" search on the Noctua landing page:

ComplexPortal:CPX-566: first result CPX-566: hard to find CPX 566: first result "CPX-566": first and only result CPX-566 Spom: first result

First, to reiterate, this should not be an issue and we would like to prioritize fully fixing our search at so that we don't need to have these conversations. That aside, for context for what we're seeing here today:

The two indexes here treat a couple of things a little differently, which is why we get the different results. What is likely happening in the second case (that is being used by the Noctua interface) is that when the CPX-566 string is being read in the dash is removed and the index looks for things that have "CPX"-ness or "566"-ness. Having a lot of "CPX"-ness outweighs having a little "566"-ness, so the desired result here gets lost. When the string CPX 566 is used, the string is read in and it understands that there needs to be "CPX"-ness and "566"-ness; there is only one thing that fits both of those criteria best and the desired results gets returned.

Technically speaking, there are things one can do in a case like this to ensure better results (e.g. when there is a dash also search for the quoted string or something), but we will need to weigh the effort needed to make and tune that versus the effort to just "start over" on the autocomplete with a newer and more robust system.

EDIT:

Noting that we have a redo NEO pipeline (https://github.com/geneontology/project-management/issues/52) and some notes on redoing AmiGO, it might be worth it to spec out redoing NEO and Noctua autocomplete as a separate standalone project that could be almost a drop-in replacement, then use that to inform future AmiGO and GO API work (or feed it into the GO API first).

ValWood commented 4 weeks ago

Just to say, shouldn't gene product searches alsways be exact matches, exactly as the user typed the, (i.e no 'fuzziness') v

kltm commented 4 weeks ago

@ValWood Again, I'm not talking about what should be--I think we all agree on that00just clarifying the mechanics what is now for anybody diving into this. A fix can be applied either in the backend or frontend, with the immediate issue being around the mishandling of the dash in the identifier (which is essentially being treated like whitespace in this case). Special-case coding could likely be added to fix this edge case, but it might be worth weighting that against longer-term fixes and other fixes that are being queued up for Noctua.