Allow tomato annotations through general channels (SGN not canonical resource)

kltm commented 5 years ago

The outcome of this should be that tomato annotation do not get filtered out of goa_uniprot_all and are able to get picked up by AmiGO and downstreams (like PANTHER).

https://github.com/geneontology/pipeline/issues/92 https://github.com/geneontology/go-site/pull/1090

Tagging a mess o' people: @dustine32 @dougli1sqrd @pgaudet @cmungall

kltm commented 5 years ago

Okay, from a quick survey, it looks like there is not much in the way of overlap. In goa:

      1 RCA
      4 IGI
      8 TAS
     19 ISS
     35 IPI
     37 IEP
     52 IMP
     98 IDA
   4365 IBA
 103028 IEA

In sgn:

      2 IPI
      3 IGI
      6 IC
      7 TAS
     54 IMP
    134 IEP
    177 ISS
   1075 IDA

kltm commented 5 years ago

Okay, to clarify, @thomaspd 's issue is the single namespace, rather than conflicting annotation.

dustine32 commented 5 years ago

From talking with @thomaspd :

The namespace issue has to get cleared up first as SGN uses SGN IDs and UniProt uses UniProtKB.

Either:

SGN retains authority source status with that taxa: property set in sgn.yaml. They bring in IEAs (performing UniProtKB-to-SGN mapping themselves) from goa_uniprot_all. IBAs would have to be mapped to SGN then too, which may be a nontrivial task since this isn't what's immediately available in our Panther long IDs (ex: SOLLC|EnsemblGenome=Solyc04g014800.2|UniProtKB=A0A140TAT3), though we have ways of getting around this.
SGN relinquishes authority status by removing the taxa: property from sgn.yaml and they would convert their SGN IDs to UniProt in their GAF.

Retagging @thomaspd to make sure I explained this correct.

kltm commented 5 years ago

Given a single namespace restriction, the choices are: 1) SGN converts (possible? how long would it take?) 2) remove SGN (easy, lose some annotations) 3) filter goa (original state, loose a lot of IEA and some others)

thomaspd commented 5 years ago

Yes, just to flesh out the options, in the order of my preference:

SGN stays the "authoritative source". Like all the other authoritative sources, they convert both IEAs and IBAs (currently both using UniProt namespace) to SGN IDs and put them in their submitted GAF.
SGN changes status to contributor, and they convert their identifiers to UniProt, so the pipeline can merge them into the total GAF file (along with IEAs from UniProt and IBAs from PAINT) for tomato.
SGN retains status as authoritative source, but would not do what the other authoritative sources are doing. They would keep SGN identifiers, and eventually GO Central would map the IEAs and IBAs to SGN identifiers and merge into the total GAF file. This workflow is on our roadmap, but not implemented, so in the short term this would mean we're at the original state.
Seth's option #2 above might not be a good one, as we don't want to drop experimental annotations.

dustine32 commented 5 years ago

@thomaspd @kltm Thinking a bit more about handling IBAs. If SGN were to take it upon themselves to convert everything including IEAs to SGN namespace (Paul's #1), wouldn't the IBAs be stripped from the SGN file in accordance with gorule-26? The onus would then be on PAINT to output SGN IDs in the paint_sgn.gaf? And if SGN namespace is the winner then, regardless of who performs the initial conversion to SGN, the PAINT/Panther update pipeline would have to map SGN IDs back into the Panther long ID in order to cycle any tomato back through PAINT/Panther.

So I will explain to the SGN folks that if they want to use SGN IDs in their GAF, that's means they sign up to take responsibility for ALL tomato annotations and thus will be required to do all the wrangling of tomato annotations from other sources (goa_uniprot_all IEAs, PAINT IBAs) for depositing to GO Central. Otherwise, if they don't wanna have to handle this, they can still contribute but their GAF IDs need to be in UniProtKB namespace.

Sound good?

kltm commented 5 years ago

I think assuming the single namespace is, to some extent, a slightly new thing. For example:

bbop@wok:/tmp/bib⟫ reset && for filename in ./*.gaf.gz; do echo "$filename" && zgrep -v --no-filename "^!" $filename | cut -f 1,13 | awk '{ print $2 " " $1}' | sort | uniq | cut -d " " -f 1 | uniq -c; done;

Gives the (trimmed) output of:

./cgd.gaf.gz
      2 taxon:237561
./dictybase.gaf.gz
      2 taxon:44689
./ecocyc.gaf.gz
      2 taxon:83333
./fb.gaf.gz
      2 taxon:7227
./goa_uniprot_all_noiea.gaf.gz
      2 taxon:11676
      2 taxon:31033
./mgi.gaf.gz
      2 taxon:10090
./pamgo_oomycetes.gaf.gz
      2 taxon:67593
      2 taxon:67593|taxon:3847
./pombase.gaf.gz
      2 taxon:284812
./rgd.gaf.gz
      2 taxon:10116
./tair.gaf.gz
      2 taxon:3702
./wb.gaf.gz
      2 taxon:6239
./zfin.gaf.gz
      2 taxon:7955

That is a count of namespace per taxon. This often seems to be the resource namespace plus UniProtKB. I point this out as there seems to be no current technical restriction on this other places.

thomaspd commented 5 years ago

Wow, wonders never cease! We should address this at the next GO meeting. I think it's really important for users that we have a single namespace per GAF.

pgaudet commented 5 years ago

@kltm What are the namespaces ? (there are 2, but which ones?) Perhaps we can start to fix this before the GOC meeting?

Thanks, Pascale

kltm commented 5 years ago

@pgaudet As above, they are by and large the resource namespace and UniProtKB. The exceptions seem to be:

pamgo_oomycetes.gaf.gz
 taxon:67593 NCBI_GP
 taxon:67593 PAMGO_VMD
 taxon:67593|taxon:3847 NCBI_GP
 taxon:67593|taxon:3847 PAMGO_VMD
goa_uniprot_all_noiea.gaf.gz
 taxon:8355 RNAcentral
 taxon:8355 UniProtKB
 taxon:8090 RNAcentral
 taxon:8090 UniProtKB
 taxon:7788 ComplexPortal
 taxon:7788 UniProtKB
 taxon:31033 RNAcentral
 taxon:31033 UniProtKB
 taxon:11676 RNAcentral
 taxon:11676 UniProtKB

pgaudet commented 5 years ago

I think the namespace must take into account the type of object ? It seems correct to me that we use ComplexPortal, UniProtKB and RNAcentral for the same taxon.

@alexsign @vanaukenk

cmungall commented 5 years ago

Getting back to tomato and looking at the counts @kltm produced above (https://github.com/geneontology/go-site/issues/1091#issuecomment-492442631) - I was surprised by the differences as I would have expected goa to hoover up the sgn annotations - yet we have 1k IDAs in SGN and ~100 in the uniprot file.

It looks like the majority of these SGN gene IDs may not be mapped to UniProt IDs? If so this is upstream of us, cc @alexsign is this the case?

For now I think the best thing to do is to include both SGN and UniProt (Seth's suggestion) even though there will be some redundancy with the same thing with different IDs, but we need to have a canonical set of IDs for tomato..

kltm commented 5 years ago

From the software discussion today, with input from @thomaspd and @cmungall , we'll be temporarily going with the permissive approach and allow tomato to have two possible namespaces in different files. Literally, remove the filter from sgn.yaml, allowing annotations from both SGN and GOA. As implemented in https://github.com/geneontology/go-site/pull/1090 This is the current state of the pipeline, no further action should be needed.

We currently have no ticket for this roadmap issue--it is essentially a larger question of how we handle various inputs as we move forward with both accepting more upstreams and centralizing many use cases.

pgaudet commented 5 years ago

We currently have no ticket for this roadmap issue--it is essentially a larger question of how we handle various inputs as we move forward with both accepting more upstreams and centralizing many use cases.

@kltm Can you open a ticket ? I think you would formulate the issue better than I would. We don't want this to fall through the cracks.

Thanks, Pascale

kltm commented 5 years ago

@pgaudet I don't think there is anything more to do on this ticket as it stands. I would correct myself and say it's more of a project unto itself (as in exhaustive software list) and still TBD.

pgaudet commented 5 years ago

Right, I was suggesting to open a ticket to make sure it doesn't fall to the cracks - can you open a ticket that gives a quick summary of what that project would be?

pgaudet commented 5 years ago

@dustine32 Can you let is know where this now stands ?

kltm commented 5 years ago

The current discussed outcome for this ticket is:

remove active metadata from SGN
- remove gaf entry
- remove canonical taxon entry for SGN

This will allow tomato entries to come in from GOA, at the cost of experimental from SGN. This is considered worth the cost to ensure that confusion from multiple namespaces does not occur.

geneontology / go-site

Allow tomato annotations through general channels (SGN not canonical resource) #1091