geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Periodically or dynamically create or edit the taxon slim used by GO to include all annotated and annotatable species #1955

Closed kltm closed 1 year ago

kltm commented 1 year ago

Currently, some species that are used, mostly in Noctua, are not represented in the taxon slim, causing issues like missing labels, etc. See: https://github.com/geneontology/noctua-landing-page/issues/87

We would like to come up with an SOP or periodic updates or a dynamic solution.

pgaudet commented 1 year ago

Can we add this to a project ?

kltm commented 1 year ago

@pgaudet No problem, but I'm not sure where it would go. I suspect it's its own mini-project when all said and done.

cmungall commented 1 year ago

here are the instructions to add to taxslim https://github.com/obophenotype/ncbitaxon/blob/master/subsets/README.md

pgaudet commented 1 year ago

@balhoff The species should go in this file: https://github.com/obophenotype/ncbitaxon/blob/master/subsets/taxon-subset-ids.txt

kltm commented 1 year ago

geneontology/neo make neo.obo uses http://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed.gpi.gz

...@.../neo$ grep -oh "NCBITaxon:[0-9]*" neo.obo | sort | uniq | wc -l
14182
@.../neo$ grep -oh "taxon:[0-9]*" mirror/uniprot_reviewed.gpi.tmp | sort | uniq | wc -l
14216

@pgaudet there are currently ~6k entries in https://raw.githubusercontent.com/obophenotype/ncbitaxon/master/subsets/taxon-subset-ids.txt. Combining and de-duping these, we get 16283 entries. Does that sound right?

kltm commented 1 year ago

If this looks right, I can do a PR (had to fork as didn't have perms for demonstration) https://github.com/geneontology/ncbitaxon/blob/issue-go-site-1955-annotatable-species/subsets/taxon-subset-ids.txt @pgaudet @cmungall

pgaudet commented 1 year ago

Looks OK For example Taxon 107268 has some reviewed entries: ~https://www.uniprot.org/uniprotkb/A0A8K1C0V2/entry~  (this one was NOT a reviewed entry)

https://www.uniprot.org/uniprotkb/Q9T3Q2/entry

Could you merge this?

Thanks, Pascale

balhoff commented 1 year ago

@kltm check the second to last line: https://github.com/geneontology/ncbitaxon/blob/946f1758908edd4d11a0b77030fbcd3264643f05/subsets/taxon-subset-ids.txt#L16282

pgaudet commented 1 year ago

Oups! Didn't read this far !

kltm commented 1 year ago

@balhoff Whoops, that's me--I probably introduced that with cating the files together.

kltm commented 1 year ago

I don't have permission to go beyond https://github.com/obophenotype/ncbitaxon/pull/74 Tagging @cmungall @pgaudet @balhoff

balhoff commented 1 year ago

Can you fix the conflict? Then I think I can merge.

kltm commented 1 year ago

@balhoff I do not have the power to fix the conflict in that repo once the PR is created. It seems to just be additions, so I'm not sure why it's choking...

balhoff commented 1 year ago

Okay, I fixed and merged it.

kltm commented 1 year ago

@balhoff Cheers!

@pgaudet To close out this issue, we need to be doing this periodically or dynamically. While it's tempting to spend the energy adding something automated to the NEO pipeline to do this, I get the feeling that once a year might be fine? If we can work out what frequency and how to remind ourselves, would that allow this to be closed?

balhoff commented 1 year ago

I agree, I think keeping it up to date as needed, and checking periodically would be fine.

pgaudet commented 1 year ago

I would also do this 'on request'

Will this be added in Noctua in the next Noctua update? Or can this be done separately?

kltm commented 1 year ago

@pgaudet (Just writing down our earlier conversation), the answer is "both". We do this as part of the Noctua update outages every two weeks, refreshing minerva and solr with NEO. We can also do just solr v/quickly at any point, but that means that things appear as just IDs in Noctua everywhere that's not an autocomplete.

pgaudet commented 1 year ago

@balhoff

Why isnt the species showing for Q9T3Q2 in http://noctua.geneontology.org/workbench/noctua-visual-pathway-editor/?model_id=gomodel%3A636d9ce800000575

image

Thanks, Pascale

balhoff commented 1 year ago

Based on discussion at GOC meeting we likely need to update the NEO build code which puts species name abbreviations into gene names.

kltm commented 1 year ago

Continuing here: https://github.com/geneontology/neo/issues/116