Open kltm opened 5 years ago
Okay, from a quick survey, it looks like there is not much in the way of overlap. In goa:
1 RCA
4 IGI
8 TAS
19 ISS
35 IPI
37 IEP
52 IMP
98 IDA
4365 IBA
103028 IEA
In sgn:
2 IPI
3 IGI
6 IC
7 TAS
54 IMP
134 IEP
177 ISS
1075 IDA
Okay, to clarify, @thomaspd 's issue is the single namespace, rather than conflicting annotation.
From talking with @thomaspd :
The namespace issue has to get cleared up first as SGN uses SGN IDs and UniProt uses UniProtKB.
Either:
taxa:
property set in sgn.yaml
. They bring in IEAs (performing UniProtKB-to-SGN mapping themselves) from goa_uniprot_all
. IBAs would have to be mapped to SGN then too, which may be a nontrivial task since this isn't what's immediately available in our Panther long IDs (ex: SOLLC|EnsemblGenome=Solyc04g014800.2|UniProtKB=A0A140TAT3
), though we have ways of getting around this.taxa:
property from sgn.yaml
and they would convert their SGN IDs to UniProt in their GAF.Retagging @thomaspd to make sure I explained this correct.
Given a single namespace restriction, the choices are: 1) SGN converts (possible? how long would it take?) 2) remove SGN (easy, lose some annotations) 3) filter goa (original state, loose a lot of IEA and some others)
Yes, just to flesh out the options, in the order of my preference:
SGN stays the "authoritative source". Like all the other authoritative sources, they convert both IEAs and IBAs (currently both using UniProt namespace) to SGN IDs and put them in their submitted GAF.
SGN changes status to contributor, and they convert their identifiers to UniProt, so the pipeline can merge them into the total GAF file (along with IEAs from UniProt and IBAs from PAINT) for tomato.
SGN retains status as authoritative source, but would not do what the other authoritative sources are doing. They would keep SGN identifiers, and eventually GO Central would map the IEAs and IBAs to SGN identifiers and merge into the total GAF file. This workflow is on our roadmap, but not implemented, so in the short term this would mean we're at the original state.
Seth's option #2 above might not be a good one, as we don't want to drop experimental annotations.
@thomaspd @kltm Thinking a bit more about handling IBAs. If SGN were to take it upon themselves to convert everything including IEAs to SGN namespace (Paul's #1), wouldn't the IBAs be stripped from the SGN file in accordance with gorule-26? The onus would then be on PAINT to output SGN IDs in the paint_sgn.gaf
? And if SGN namespace is the winner then, regardless of who performs the initial conversion to SGN, the PAINT/Panther update pipeline would have to map SGN IDs back into the Panther long ID in order to cycle any tomato back through PAINT/Panther.
So I will explain to the SGN folks that if they want to use SGN IDs in their GAF, that's means they sign up to take responsibility for ALL tomato annotations and thus will be required to do all the wrangling of tomato annotations from other sources (goa_uniprot_all IEAs, PAINT IBAs) for depositing to GO Central. Otherwise, if they don't wanna have to handle this, they can still contribute but their GAF IDs need to be in UniProtKB namespace.
Sound good?
I think assuming the single namespace is, to some extent, a slightly new thing. For example:
bbop@wok:/tmp/bib⟫ reset && for filename in ./*.gaf.gz; do echo "$filename" && zgrep -v --no-filename "^!" $filename | cut -f 1,13 | awk '{ print $2 " " $1}' | sort | uniq | cut -d " " -f 1 | uniq -c; done;
Gives the (trimmed) output of:
./cgd.gaf.gz
2 taxon:237561
./dictybase.gaf.gz
2 taxon:44689
./ecocyc.gaf.gz
2 taxon:83333
./fb.gaf.gz
2 taxon:7227
./goa_uniprot_all_noiea.gaf.gz
2 taxon:11676
2 taxon:31033
./mgi.gaf.gz
2 taxon:10090
./pamgo_oomycetes.gaf.gz
2 taxon:67593
2 taxon:67593|taxon:3847
./pombase.gaf.gz
2 taxon:284812
./rgd.gaf.gz
2 taxon:10116
./tair.gaf.gz
2 taxon:3702
./wb.gaf.gz
2 taxon:6239
./zfin.gaf.gz
2 taxon:7955
That is a count of namespace per taxon. This often seems to be the resource namespace plus UniProtKB. I point this out as there seems to be no current technical restriction on this other places.
Wow, wonders never cease! We should address this at the next GO meeting. I think it's really important for users that we have a single namespace per GAF.
@kltm What are the namespaces ? (there are 2, but which ones?) Perhaps we can start to fix this before the GOC meeting?
Thanks, Pascale
@pgaudet As above, they are by and large the resource namespace and UniProtKB. The exceptions seem to be:
pamgo_oomycetes.gaf.gz
taxon:67593 NCBI_GP
taxon:67593 PAMGO_VMD
taxon:67593|taxon:3847 NCBI_GP
taxon:67593|taxon:3847 PAMGO_VMD
goa_uniprot_all_noiea.gaf.gz
taxon:8355 RNAcentral
taxon:8355 UniProtKB
taxon:8090 RNAcentral
taxon:8090 UniProtKB
taxon:7788 ComplexPortal
taxon:7788 UniProtKB
taxon:31033 RNAcentral
taxon:31033 UniProtKB
taxon:11676 RNAcentral
taxon:11676 UniProtKB
I think the namespace must take into account the type of object ? It seems correct to me that we use ComplexPortal, UniProtKB and RNAcentral for the same taxon.
@alexsign @vanaukenk
Getting back to tomato and looking at the counts @kltm produced above (https://github.com/geneontology/go-site/issues/1091#issuecomment-492442631) - I was surprised by the differences as I would have expected goa to hoover up the sgn annotations - yet we have 1k IDAs in SGN and ~100 in the uniprot file.
It looks like the majority of these SGN gene IDs may not be mapped to UniProt IDs? If so this is upstream of us, cc @alexsign is this the case?
For now I think the best thing to do is to include both SGN and UniProt (Seth's suggestion) even though there will be some redundancy with the same thing with different IDs, but we need to have a canonical set of IDs for tomato..
From the software discussion today, with input from @thomaspd and @cmungall , we'll be temporarily going with the permissive approach and allow tomato to have two possible namespaces in different files. Literally, remove the filter from sgn.yaml, allowing annotations from both SGN and GOA. As implemented in https://github.com/geneontology/go-site/pull/1090 This is the current state of the pipeline, no further action should be needed.
We currently have no ticket for this roadmap issue--it is essentially a larger question of how we handle various inputs as we move forward with both accepting more upstreams and centralizing many use cases.
We currently have no ticket for this roadmap issue--it is essentially a larger question of how we handle various inputs as we move forward with both accepting more upstreams and centralizing many use cases.
@kltm Can you open a ticket ? I think you would formulate the issue better than I would. We don't want this to fall through the cracks.
Thanks, Pascale
@pgaudet I don't think there is anything more to do on this ticket as it stands. I would correct myself and say it's more of a project unto itself (as in exhaustive software list) and still TBD.
Right, I was suggesting to open a ticket to make sure it doesn't fall to the cracks - can you open a ticket that gives a quick summary of what that project would be?
@dustine32 Can you let is know where this now stands ?
The current discussed outcome for this ticket is:
This will allow tomato entries to come in from GOA, at the cost of experimental from SGN. This is considered worth the cost to ensure that confusion from multiple namespaces does not occur.
The outcome of this should be that tomato annotation do not get filtered out of goa_uniprot_all and are able to get picked up by AmiGO and downstreams (like PANTHER).
https://github.com/geneontology/pipeline/issues/92 https://github.com/geneontology/go-site/pull/1090
Tagging a mess o' people: @dustine32 @dougli1sqrd @pgaudet @cmungall