geneontology / neo

noctua entity ontology
9 stars 2 forks source link

PRO ids map to more than one gene in PomBase gpi file - causing a neo build error #74

Closed vanaukenk closed 3 years ago

vanaukenk commented 3 years ago

Yesterday's neo build failed with an error that one PRO id in the PomBase gpi file maps to more than one primary name.

See the Slack thread here: https://geneontologyworkspace.slack.com/archives/C01Q3GL2Y7J/p1632252291079600

@ValWood @kimrutherford - these look to be PRO ids for histones, so perhaps there is a bona fide 1:many mapping here at least wrt shared sequence?

Since neo is used for Noctua annotation, we require unique 1:1 mappings between ids and primary names. Do you think it's possible to get unique PRO ids for each of the individual pombe histone genes?

@kltm @cmungall @balhoff

ValWood commented 3 years ago

Which PRO ID is it? I can take a look. Some of the PRO IDs apply to any histone of a specific type (i.e. H1 etc because they are exact duplicates ), and some are specific for the actual entity.

I thought I had only used the specific ones, but it is possible I used a generic one.

vanaukenk commented 3 years ago

PR:000027593 hht2 hht2 h3.2 SO:0001217 NCBITaxon:4896 PomBase:SPBC8D2.04 PomBase:SPBC8D2.04.1:pep UniProtKB:P09988 go-annotation-summary=histone H3 h3.2

PR:000027593 hht1 hht1 SO:0001217 NCBITaxon:4896 PomBase:SPAC1834.04 PomBase:SPAC1834.04.1:pep UniProtKB:P09988 go-annotation-summary=histone H3 h3.1

ValWood commented 3 years ago

Well I can fix it to be more specific (will try to get to it later today if the PRO terms already exist for the modified form). But the generic PRo IDs should really work for this purpose so we should be able to use them in this way. It seems to be a GO issue rather than. an annotation issue.

It's odd that this is the first time the issue has arisen. Other species have many more copies of histones and it wouldn't really be practical for them to request a PROID for each copy every time the have a modified form... do they never use the generic form?

vanaukenk commented 3 years ago

@ValWood I'm not sure how many groups are annotating to modified proteins as their primary annotation objects in Noctua, so that may be why it hasn't arise yet in this context.

Do you currently use these ids in the Annotation Isoform column of a GAF or as the primary annotation object? Or something else, i.e. annotation extension object?

ValWood commented 3 years ago

The primary object is the gene. The isoform is in column 17.

I don't know where this ends up in the GPAD as I haven't been involved in that.

@mah11 or @kimrutherfordcan let you know if I overlooked anything.

In fact I may not be able to fix this- At least I don't have enough information above. @vanaukenk can you supply the problematic annotation?

ValWood commented 3 years ago

I'm still not sure why it isn't a valid object in column 17, and that's the only place we would use these IDs.

ValWood commented 3 years ago

I don't know where these are coming from for GO. In annotation we always used the specific form https://www.pombase.org/gene/SPAC1834.04 https://www.pombase.org/gene/SPBC8D2.04 at least for hht annotation. It is possible this is in a has_input annotation on one of the binding partners...

kltm commented 3 years ago

The examples we ran into in this case seem to be: PR:000027593 hht2 hht2 h3.2 SO:0001217 NCBITaxon:4896 PomBase:SPBC8D2.04 PomBase:SPBC8D2.04.1:pep UniProtKB:P09988 go-annotation-summary=histone H3 h3.2 PR:000027593 hht1 hht1 SO:0001217 NCBITaxon:4896 PomBase:SPAC1834.04 PomBase:SPAC1834.04.1:pep UniProtKB:P09988 go-annotation-summary=histone H3 h3.1 What's choking the system is the same identifier with two different labels.

mah11 commented 3 years ago

The problem comes from these two annotations:

PomBase SPAC1834.04 hht1        GO:0005515  PMID:22727667   IPI PomBase:SPBC428.08c F   histone H3 h3.1     protein taxon:4896  20210913    PomBase     PR:000027593
PomBase SPAC1834.04 hht1        GO:0005515  PMID:22727667   IPI PomBase:SPAC664.01c F   histone H3 h3.1     protein taxon:4896  20210913    PomBase     PR:000027593

They should use PR:000027578 instead.

PR:000027593 is specific for hht2, and PR:000027578 for hht1.

mah11 commented 3 years ago

Actually, looking at the full set of annotations from that paper (in Canto), it seems more likely that we should switch the annotation to hht1. I've got it open now anyway, so I'll fix it.

mah11 commented 3 years ago

OK, corrected annotations will be in our next GAF update (Monday 2021-09-27, unless you need a special edition in a hurry).

ValWood commented 3 years ago

OK I misinterpreted the problem. I didn' consider that I used the wrong PRO ID. That's a useful sanity check!

Thanks @mah11 !

kltm commented 3 years ago

From the GO perspective, I'd leave the decision on "special edition" to @vanaukenk ; I'd note that this would hold up a NEO update until late next week given our current schedule, but we can always bump the Friday update pretty easily.

vanaukenk commented 3 years ago

@kltm MGI is waiting on four PRO ids. What are our options here? Is there a previous PomBase gpi file we can use?

kltm commented 3 years ago

@vanaukenk Hm...not on hand I think. I suspect would be easiest for somebody to hand edit it and make it public, then update the metadata for the GPI for pombase to point to the location of that file. Or wait for Monday (or after) / have PomBase produce the special edition.

vanaukenk commented 3 years ago

@mah11 - how much trouble would it be to create a special edition file before Monday?

ValWood commented 3 years ago

@vanaukenk Couldn't you just delete the offending IDs from the file?

kimrutherford commented 3 years ago

The GPI file is updated every night (UK time) with the latest changes so I think it should be OK in 5 or 6 hours (3am or so UTC). I'll double check when our update is done.

vanaukenk commented 3 years ago

Thank you @kimrutherford

kimrutherford commented 3 years ago

The GPI file is updated every night (UK time) with the latest changes

It's updated now.

PR:000027593 hht1 hht1 SO:0001217 NCBITaxon:4896 PomBase:SPAC1834.04 PomBase:SPAC1834.04.1:pep UniProtKB:P09988 go-annotation-summary=histone H3 h3.1

This line has gone so I think that will solve the duplicate PRO ID problem.

ValWood commented 3 years ago

@vanaukenk could GO add a pre submission check that a PRO ID is only assigned to a single protein entity? We could do this but as other resources might also make this error and it isn't high priority for us but it breaks GO. v

vanaukenk commented 3 years ago

@kltm @balhoff Can we revise the neo build such that, in the future, if there are issues like this, gpi lines are skipped, but the neo build nevertheless continues? Groups would then be notified in a report what lines in their file failed to be incorporated into neo. I'm sure you've thought of this :-)

kltm commented 3 years ago

@vanaukenk It could be done, preferably as an overall refactor of the NEO build. That's in the list of future projects to consider.

kltm commented 3 years ago

The "neo" pipeline is now failing with:

11:07:42  Caused by: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(PR:000028992 id( PR:000028992)synonym( SPBC19C2.09.1:pep RELATED)synonym( ofd1 BROAD)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)synonym( 000028992 RELATED)xref( PomBase:SPBC19C2.09.1:pep)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct)name( sre1 Spom)synonym( sre1 BROAD)xref( PomBase:SPBC6B1.08c.1:pep)synonym( uS12 RELATED)synonym( SPBC6B1.08c.1:pep RELATED)name( ofd1 Spom)is_a( CHEBI:33695)relationship( in_taxon NCBITaxon:4896)relationship( has_gene_template PomBase:SPBC19C2.09)relationship( has_gene_template PomBase:SPBC6B1.08c))

Running through the pombase gpi with zcat pombase.gpi.gz | grep -v ^! | cut -f 1,2 | sort | uniq -c | cut -f -1 | uniq -c | less, I found the following anomalies:

      1       1 PomBase:SPBC557.03c,PR:000045536
      2       1 PR:000028992
      2       1 PR:000044737

Looking at those individually, we have two sets of repeat labels:

PR:000044737    rpb1    rpb1        SO:0001217  NCBITaxon:4896  PomBase:SPBC28F2.12 PomBase:SPBC28F2.12.1:pep       UniProtKB:P36594    go-annotation-summary=RNA polymerase II large subunit Rpb1
PR:000044737    ctt1    ctt1    cta1    SO:0001217  NCBITaxon:4896  PomBase:SPCC757.07c PomBase:SPCC757.07c.1:pep       UniProtKB:P55306    go-annotation-summary=catalase
PR:000028992    sre1    sre1        SO:0001217  NCBITaxon:4896  PomBase:SPBC19C2.09 PomBase:SPBC19C2.09.1:pep       UniProtKB:Q9UUD1    go-annotation-summary=DNA-binding transcription factor, sterol regulatory element binding protein Sre1
PR:000028992    ofd1    ofd1    uS12    SO:0001217  NCBITaxon:4896  PomBase:SPBC6B1.08c PomBase:SPBC6B1.08c.1:pep       UniProtKB:Q11120    go-annotation-summary=hypoxic oxygen sensor, prolyl-3,4-dihydroxylase Ofd1

and something that looks like a dupe and a format error:

PomBase:SPBC557.03c,PR:000045536    fft3    fft3    snf2SR  SO:0001217  NCBITaxon:4896  PomBase:SPAC25A8.01c    PomBase:SPAC25A8.01c.1:pep      UniProtKB:O42861    go-annotation-summary=SMARCAD1 family ATPase Fft3
PomBase:SPBC557.03c pim1    pim1    dcd1|ptr2   SO:0001217  NCBITaxon:4896              UniProtKB:P28745    go-annotation-summary=RCC1 family Ran GEF
kimrutherford commented 3 years ago

Looking at those individually, we have two sets of repeat labels:

Thanks for the details.

@ValWood @mah11 the duplicates are from these curation sessions:

  PR:000044737  7c2513328f82d364        SPBC28F2.12
  PR:000044737  7c2513328f82d364        SPCC757.07c
  PR:000028992  4d5512a5dc8ea4bc        SPBC19C2.09
  PR:000028992  4d5512a5dc8ea4bc        SPBC6B1.08c

and something that looks like a dupe and a format error: PomBase:SPBC557.03c,PR:000045536

This problem is from chromosome1.contig in the annotation for SPAC25A8.01c

I've added some extra checks to the PomBase nightly update so that these sort of problems will be reported to the curators if they happen again. (Val, Midori the warnings will be in the ".chado_checks" file from tomorrow).

ValWood commented 3 years ago

And they were!

kltm commented 3 years ago

Looking better! Taking a look at today's attempt, most of the issues seem resolved, with one remaining:

zcat pombase.gpi.gz | grep -v ^! | cut -f 1,2 | sort | uniq -c | cut -f -1 | uniq -c | grep "  2"
      2       1 PR:000044737
zcat pombase.gpi.gz | grep PR:000044737
PR:000044737    rpb1    rpb1        SO:0001217  NCBITaxon:4896  PomBase:SPBC28F2.12 PomBase:SPBC28F2.12.1:pep       UniProtKB:P36594    go-annotation-summary=RNA polymerase II large subunit Rpb1
PR:000044737    ctt1    ctt1    cta1    SO:0001217  NCBITaxon:4896  PomBase:SPCC757.07c PomBase:SPCC757.07c.1:pep       UniProtKB:P55306    go-annotation-summary=catalase
kltm commented 3 years ago

As PomBase is not currently actively using Noctua for curation, just to move the NEO pipeline along temporarily (to get some other identifiers available), we'll momentarily suspend the PomBase GPI, run our end, and then re-enable the PomBase GPI.

ValWood commented 3 years ago

Ok these are in our logs

PRO IDs used for more than one gene - CHECK FAILURE: expected 0 but got 2 PR:000044737 7c2513328f82d364 SPBC28F2.12.1 PR:000044737 7c2513328f82d364 SPCC757.07c.1

I was supposed to fix them today. I will do them tomorrow morning...sorry about that...

ValWood commented 3 years ago

Now fixed. I had made the most bonkers annotation ever! A copy edit error I hope....

kltm commented 3 years ago

Great--thank you! Retrying.

kimrutherford commented 3 years ago

The duplicate is still in our GPI file for now because we haven't had a nightly update since Val made her fix. Tonight's update will finish at 3am or so UTC.

kltm commented 3 years ago

@ValWood @kimrutherford Great--thank you for you help. This now seems to be cleared.