Open ValWood opened 3 months ago
cc @kimrutherford
Noting that this should be ComplexPortal:CPX-566
(as http://noctua-amigo.berkeleybop.org/amigo/term/ComplexPortal:CPX-566 in our side instance)
I am able to find this term in at least some locations in the Noctua interface.
So I should always put ComplexPortal:CPX-566? I find other complex IDs without the prefix, and we don't need the prefix to locate other entities?
cc @PCarme
@ValWood We can dig into this a little when @vanaukenk is back, but it may be that the difference is what is supplied in the synonyms, etc.
Are there any docs for how to specify complexes in GPI ? @kimrutherford can check that we are doing it correctly. No hurry until @vanaukenk is back.
Hi @vanaukenk can you let us know how complexes should be specified in the GPAD so we can check that we are doing it correctly? Thanks, val
@vanaukenk I have just got our devs to add complexes to our gpi (not in production yet) based on SGD's gpi and would like to check that the file is spec'd correctly as well.
@ValWood @hattrill
There are some issues surrounding use of ComplexPortal ids in GO-CAMs that need to be definitively resolved. I propose that we use next week's Pathways2GO and GO-CAM call time slots to focus on that and then we will know better what to do wrt the gpi file.
Are you both available next Thursday?
@vanaukenk that is good for me. Thanks
Hi @vanaukenk can you let us know how complexes should be specified in the GPAD so we can check that we are doing it correctly?
Is this helpful?
https://geneontology.org/docs/gene-product-information-gpi-format/
We can add a protein complex example.
Is this helpful? https://geneontology.org/docs/gene-product-information-gpi-format/ We can add a protein complex example.
Thanks Pascale.
We've been using the GPI 2.0 spec: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md
Perhaps that's a problem?
We've been using the GPI 2.0 spec:
We're putting the complex members in column 9 ("Protein_Containing_Complex_Members"), following the spec. The spec says Prefix
':' Local_ID
separated by pipes. Prefix
in our case is "PomBase". The example in the spec has UniProt IDs. Maybe that causes a problem?
This is an example of column 9 from our GPI file:
PomBase:SPAC29E6.08|PomBase:SPBC13E7.10c|PomBase:SPCC1919.14c
Noting here there are still some issues with SGD's complexes:
https://release.geneontology.org/2024-09-08/annotations/sgd.gpi.gz has
SGD:S000218003 CPX-1852 RPD3L histone deacetylase complex|RPD3L complex|RPD3(L)|RPD3/SIN3 large histone deacetylase complex|3.5.1.98|8GA8|EMD-29892|8HPO|EMD-34935 protein_complex taxon:559292
https://release.geneontology.org/2024-09-08/products/upstream_and_raw_data/sgd-src.gpi.gz has:
SGD:S000218003 ASH1:CTI6:DEP1:PHO23:RPD3:RXT2:RXT3:SAP30:SDS3:SIN3:UME1:UME6 RPD3L histone deacetylase complex GO:0032991 taxon:559292 S000005274|S000005274|S000001346|S000001346|S000001346|S000005364|S000005364|S000005364|S000005364|S000000299|S000000299|S000000299|S000000299|S000000299|S000006060|S000006102|S000004876|S000004876|S000005041|S000005041|S000005041|S000002234|S000002234|S000000011|S000000011|S000000011 ComplexPortal:CPX-1852
To find this complex in Noctua, the only current way is to enter S000218003 or SGD:S000218003 in the Term box, where the entity pops up with as ASH1CTI6DEP1PHO23RPD3RXT2RXT3SAP30SDS3SIN3UME1UME6 Scer
. Curators expect CPX-1852
to work but doesn't, although I've found that searching for ASH
in the Term box also works, but that's not an obvious name and isn't in the GPI provided by SGD.
SGD is modifying the supplied GPI and the next available GPI from SGD will look more like the /annotations/sgd.gpi :
SGD:S000218003 CPX-1852 RPD3L histone deacetylase complex GO:0032991 taxon:559292 S000005274|S000005274|S000001346|S000001346|S000001346|S000005364|S000005364|S000005364|S000005364|S000000299|S000000299|S000000299|S000000299|S000000299|S000006060|S000006102|S000004876|S000004876|S000005041|S000005041|S000005041|S000002234|S000002234|S000000011|S000000011|S000000011 ComplexPortal:ASH1:CTI6:DEP1:PHO23:RPD3:RXT2:RXT3:SAP30:SDS3:SIN3:UME1:UME6
Strongly related ticket https://github.com/geneontology/noctua/issues/914
Update, I have been able to locate complexes only If I omit the hyphen from the identifier. So, if I search for "CPX 566" instead of the actual ID "CPX-566" I find it???
...but the has_parts are not automatically imported
I'm looking into this some more today.
So far, what I find when searching in the gene product field is:
CPX-566 does return the right complex, but it is very far down on the autocomplete selection list, i.e. the 40th entity listed CPX 566 floats the entry to the top of the list ComplexPortal:CPX-566 floats the entry to the top of the list CPX-566 SPom has the entry second in the list
The search behavior is the same in the VPE as well as the standard annotation editor. I'll ask @tmushayahama about the search criteria to see if there's anything we can do to bump the right enty to the top of the search list when using CPX-566, as I am assuming that's the entry you'd most likely make? @ValWood
@suzialeksander I'm still looking into the SGD issues, as I can't find the SGD complexes in noctua-amigo, suggesting that this is a different problem.
@suzialeksander
I've been looking into the SGD gpi and protein complexes and honestly don't understand what's happening here. I see the exact same behavior you see.
I'll need some help troubleshooting from @kltm and @tmushayahama
...but the has_parts are not automatically imported
@ValWood - we haven't done any work yet to implement this functionality, but are aware it would be very helpful.
if there's anything we can do to bump the right enty to the top of the search list when using CPX-566, as I am assuming that's the entry you'd most likely make?
Yes. Its strange that IDs with spaces take priority over the correct identifier. As far as I'm aware, identifiers never have spaces?
I wanted to clarify a little about what is going on here wrt CPX-566
. I'm not justifying it or saying it's good--issues with the autocomplete are well known and numerous https://github.com/geneontology/amigo/issues/131https://github.com/geneontology/amigo/issues/120 https://github.com/geneontology/amigo/issues/102, but I wanted to give context for the mechanisms here.
I'd have to look into the exact math to be sure, but essentially, when looking for http://noctua-amigo.berkeleybop.org/amigo/term/ComplexPortal:CPX-566 , there are a few ways to get at it.
If we look at the general index search on the noctua autocomplete AmiGO instance (http://noctua-amigo.berkeleybop.org, upper-right):
ComplexPortal:CPX-566
: first result
CPX-566
: first result
CPX 566
: second result, with first result being "understandable"
"CPX-566"
: first and only result
CPX-566
Spom: first result
If we look at the "Filter by Term" "ontology" search on the Noctua landing page:
ComplexPortal:CPX-566
: first result
CPX-566
: hard to find
CPX 566
: first result
"CPX-566"
: first and only result
CPX-566 Spom
: first result
First, to reiterate, this should not be an issue and we would like to prioritize fully fixing our search at so that we don't need to have these conversations. That aside, for context for what we're seeing here today:
The two indexes here treat a couple of things a little differently, which is why we get the different results. What is likely happening in the second case (that is being used by the Noctua interface) is that when the CPX-566
string is being read in the dash is removed and the index looks for things that have "CPX"-ness or "566"-ness. Having a lot of "CPX"-ness outweighs having a little "566"-ness, so the desired result here gets lost. When the string CPX 566
is used, the string is read in and it understands that there needs to be "CPX"-ness and "566"-ness; there is only one thing that fits both of those criteria best and the desired results gets returned.
Technically speaking, there are things one can do in a case like this to ensure better results (e.g. when there is a dash also search for the quoted string or something), but we will need to weigh the effort needed to make and tune that versus the effort to just "start over" on the autocomplete with a newer and more robust system.
EDIT:
Noting that we have a redo NEO pipeline (https://github.com/geneontology/project-management/issues/52) and some notes on redoing AmiGO, it might be worth it to spec out redoing NEO and Noctua autocomplete as a separate standalone project that could be almost a drop-in replacement, then use that to inform future AmiGO and GO API work (or feed it into the GO API first).
Just to say, shouldn't gene product searches alsways be exact matches, exactly as the user typed the, (i.e no 'fuzziness') v
@ValWood Again, I'm not talking about what should be--I think we all agree on that00just clarifying the mechanics what is now for anybody diving into this. A fix can be applied either in the backend or frontend, with the immediate issue being around the mishandling of the dash in the identifier (which is essentially being treated like whitespace in this case). Special-case coding could likely be added to fix this edge case, but it might be worth weighting that against longer-term fixes and other fixes that are being queued up for Noctua.
I can't find this complex in Noctua:
ComplexAc: CPX-566
Even though it has been in our GPI since 2024-05-04
Can you let us know of we are doing anything wrong in the file or is a Noctua loading issue? I tried both "activity unit" and the "protein complex" entity annotons