Closed vanaukenk closed 3 years ago
Which PRO ID is it? I can take a look. Some of the PRO IDs apply to any histone of a specific type (i.e. H1 etc because they are exact duplicates ), and some are specific for the actual entity.
I thought I had only used the specific ones, but it is possible I used a generic one.
PR:000027593 hht2 hht2 h3.2 SO:0001217 NCBITaxon:4896 PomBase:SPBC8D2.04 PomBase:SPBC8D2.04.1:pep UniProtKB:P09988 go-annotation-summary=histone H3 h3.2
PR:000027593 hht1 hht1 SO:0001217 NCBITaxon:4896 PomBase:SPAC1834.04 PomBase:SPAC1834.04.1:pep UniProtKB:P09988 go-annotation-summary=histone H3 h3.1
Well I can fix it to be more specific (will try to get to it later today if the PRO terms already exist for the modified form). But the generic PRo IDs should really work for this purpose so we should be able to use them in this way. It seems to be a GO issue rather than. an annotation issue.
It's odd that this is the first time the issue has arisen. Other species have many more copies of histones and it wouldn't really be practical for them to request a PROID for each copy every time the have a modified form... do they never use the generic form?
@ValWood I'm not sure how many groups are annotating to modified proteins as their primary annotation objects in Noctua, so that may be why it hasn't arise yet in this context.
Do you currently use these ids in the Annotation Isoform column of a GAF or as the primary annotation object? Or something else, i.e. annotation extension object?
The primary object is the gene. The isoform is in column 17.
I don't know where this ends up in the GPAD as I haven't been involved in that.
@mah11 or @kimrutherfordcan let you know if I overlooked anything.
In fact I may not be able to fix this- At least I don't have enough information above. @vanaukenk can you supply the problematic annotation?
I'm still not sure why it isn't a valid object in column 17, and that's the only place we would use these IDs.
I don't know where these are coming from for GO. In annotation we always used the specific form https://www.pombase.org/gene/SPAC1834.04 https://www.pombase.org/gene/SPBC8D2.04 at least for hht annotation. It is possible this is in a has_input annotation on one of the binding partners...
The examples we ran into in this case seem to be:
PR:000027593 hht2 hht2 h3.2 SO:0001217 NCBITaxon:4896 PomBase:SPBC8D2.04 PomBase:SPBC8D2.04.1:pep UniProtKB:P09988 go-annotation-summary=histone H3 h3.2
PR:000027593 hht1 hht1 SO:0001217 NCBITaxon:4896 PomBase:SPAC1834.04 PomBase:SPAC1834.04.1:pep UniProtKB:P09988 go-annotation-summary=histone H3 h3.1
What's choking the system is the same identifier with two different labels.
The problem comes from these two annotations:
PomBase SPAC1834.04 hht1 GO:0005515 PMID:22727667 IPI PomBase:SPBC428.08c F histone H3 h3.1 protein taxon:4896 20210913 PomBase PR:000027593
PomBase SPAC1834.04 hht1 GO:0005515 PMID:22727667 IPI PomBase:SPAC664.01c F histone H3 h3.1 protein taxon:4896 20210913 PomBase PR:000027593
They should use PR:000027578 instead.
PR:000027593 is specific for hht2, and PR:000027578 for hht1.
Actually, looking at the full set of annotations from that paper (in Canto), it seems more likely that we should switch the annotation to hht1. I've got it open now anyway, so I'll fix it.
OK, corrected annotations will be in our next GAF update (Monday 2021-09-27, unless you need a special edition in a hurry).
OK I misinterpreted the problem. I didn' consider that I used the wrong PRO ID. That's a useful sanity check!
Thanks @mah11 !
From the GO perspective, I'd leave the decision on "special edition" to @vanaukenk ; I'd note that this would hold up a NEO update until late next week given our current schedule, but we can always bump the Friday update pretty easily.
@kltm MGI is waiting on four PRO ids. What are our options here? Is there a previous PomBase gpi file we can use?
@vanaukenk Hm...not on hand I think. I suspect would be easiest for somebody to hand edit it and make it public, then update the metadata for the GPI for pombase to point to the location of that file. Or wait for Monday (or after) / have PomBase produce the special edition.
@mah11 - how much trouble would it be to create a special edition file before Monday?
@vanaukenk Couldn't you just delete the offending IDs from the file?
The GPI file is updated every night (UK time) with the latest changes so I think it should be OK in 5 or 6 hours (3am or so UTC). I'll double check when our update is done.
Thank you @kimrutherford
The GPI file is updated every night (UK time) with the latest changes
It's updated now.
PR:000027593 hht1 hht1 SO:0001217 NCBITaxon:4896 PomBase:SPAC1834.04 PomBase:SPAC1834.04.1:pep UniProtKB:P09988 go-annotation-summary=histone H3 h3.1
This line has gone so I think that will solve the duplicate PRO ID problem.
@vanaukenk could GO add a pre submission check that a PRO ID is only assigned to a single protein entity? We could do this but as other resources might also make this error and it isn't high priority for us but it breaks GO. v
@kltm @balhoff Can we revise the neo build such that, in the future, if there are issues like this, gpi lines are skipped, but the neo build nevertheless continues? Groups would then be notified in a report what lines in their file failed to be incorporated into neo. I'm sure you've thought of this :-)
@vanaukenk It could be done, preferably as an overall refactor of the NEO build. That's in the list of future projects to consider.
The "neo" pipeline is now failing with:
11:07:42 Caused by: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(PR:000028992 id( PR:000028992)synonym( SPBC19C2.09.1:pep RELATED)synonym( ofd1 BROAD)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)synonym( 000028992 RELATED)xref( PomBase:SPBC19C2.09.1:pep)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct)name( sre1 Spom)synonym( sre1 BROAD)xref( PomBase:SPBC6B1.08c.1:pep)synonym( uS12 RELATED)synonym( SPBC6B1.08c.1:pep RELATED)name( ofd1 Spom)is_a( CHEBI:33695)relationship( in_taxon NCBITaxon:4896)relationship( has_gene_template PomBase:SPBC19C2.09)relationship( has_gene_template PomBase:SPBC6B1.08c))
Running through the pombase gpi with zcat pombase.gpi.gz | grep -v ^! | cut -f 1,2 | sort | uniq -c | cut -f -1 | uniq -c | less
, I found the following anomalies:
1 1 PomBase:SPBC557.03c,PR:000045536
2 1 PR:000028992
2 1 PR:000044737
Looking at those individually, we have two sets of repeat labels:
PR:000044737 rpb1 rpb1 SO:0001217 NCBITaxon:4896 PomBase:SPBC28F2.12 PomBase:SPBC28F2.12.1:pep UniProtKB:P36594 go-annotation-summary=RNA polymerase II large subunit Rpb1
PR:000044737 ctt1 ctt1 cta1 SO:0001217 NCBITaxon:4896 PomBase:SPCC757.07c PomBase:SPCC757.07c.1:pep UniProtKB:P55306 go-annotation-summary=catalase
PR:000028992 sre1 sre1 SO:0001217 NCBITaxon:4896 PomBase:SPBC19C2.09 PomBase:SPBC19C2.09.1:pep UniProtKB:Q9UUD1 go-annotation-summary=DNA-binding transcription factor, sterol regulatory element binding protein Sre1
PR:000028992 ofd1 ofd1 uS12 SO:0001217 NCBITaxon:4896 PomBase:SPBC6B1.08c PomBase:SPBC6B1.08c.1:pep UniProtKB:Q11120 go-annotation-summary=hypoxic oxygen sensor, prolyl-3,4-dihydroxylase Ofd1
and something that looks like a dupe and a format error:
PomBase:SPBC557.03c,PR:000045536 fft3 fft3 snf2SR SO:0001217 NCBITaxon:4896 PomBase:SPAC25A8.01c PomBase:SPAC25A8.01c.1:pep UniProtKB:O42861 go-annotation-summary=SMARCAD1 family ATPase Fft3
PomBase:SPBC557.03c pim1 pim1 dcd1|ptr2 SO:0001217 NCBITaxon:4896 UniProtKB:P28745 go-annotation-summary=RCC1 family Ran GEF
Looking at those individually, we have two sets of repeat labels:
Thanks for the details.
@ValWood @mah11 the duplicates are from these curation sessions:
PR:000044737 7c2513328f82d364 SPBC28F2.12
PR:000044737 7c2513328f82d364 SPCC757.07c
PR:000028992 4d5512a5dc8ea4bc SPBC19C2.09
PR:000028992 4d5512a5dc8ea4bc SPBC6B1.08c
and something that looks like a dupe and a format error: PomBase:SPBC557.03c,PR:000045536
This problem is from chromosome1.contig
in the annotation for SPAC25A8.01c
I've added some extra checks to the PomBase nightly update so that these sort of problems will be reported to the curators if they happen again. (Val, Midori the warnings will be in the ".chado_checks" file from tomorrow).
And they were!
Looking better! Taking a look at today's attempt, most of the issues seem resolved, with one remaining:
zcat pombase.gpi.gz | grep -v ^! | cut -f 1,2 | sort | uniq -c | cut -f -1 | uniq -c | grep " 2"
2 1 PR:000044737
zcat pombase.gpi.gz | grep PR:000044737
PR:000044737 rpb1 rpb1 SO:0001217 NCBITaxon:4896 PomBase:SPBC28F2.12 PomBase:SPBC28F2.12.1:pep UniProtKB:P36594 go-annotation-summary=RNA polymerase II large subunit Rpb1
PR:000044737 ctt1 ctt1 cta1 SO:0001217 NCBITaxon:4896 PomBase:SPCC757.07c PomBase:SPCC757.07c.1:pep UniProtKB:P55306 go-annotation-summary=catalase
As PomBase is not currently actively using Noctua for curation, just to move the NEO pipeline along temporarily (to get some other identifiers available), we'll momentarily suspend the PomBase GPI, run our end, and then re-enable the PomBase GPI.
Ok these are in our logs
PRO IDs used for more than one gene - CHECK FAILURE: expected 0 but got 2 PR:000044737 7c2513328f82d364 SPBC28F2.12.1 PR:000044737 7c2513328f82d364 SPCC757.07c.1
I was supposed to fix them today. I will do them tomorrow morning...sorry about that...
Now fixed. I had made the most bonkers annotation ever! A copy edit error I hope....
Great--thank you! Retrying.
The duplicate is still in our GPI file for now because we haven't had a nightly update since Val made her fix. Tonight's update will finish at 3am or so UTC.
@ValWood @kimrutherford Great--thank you for you help. This now seems to be cleared.
Yesterday's neo build failed with an error that one PRO id in the PomBase gpi file maps to more than one primary name.
See the Slack thread here: https://geneontologyworkspace.slack.com/archives/C01Q3GL2Y7J/p1632252291079600
@ValWood @kimrutherford - these look to be PRO ids for histones, so perhaps there is a bona fide 1:many mapping here at least wrt shared sequence?
Since neo is used for Noctua annotation, we require unique 1:1 mappings between ids and primary names. Do you think it's possible to get unique PRO ids for each of the individual pombe histone genes?
@kltm @cmungall @balhoff