geneontology / neo

noctua entity ontology
9 stars 2 forks source link

Include viruses and bacteria in NEO #77

Closed pgaudet closed 2 years ago

pgaudet commented 2 years ago

The file is here:

http://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_virus_bacteria.gpi

@kltm please let me know if you need more information.

Thanks, Pascale

kltm commented 2 years ago

Noting that this is ~350k line uncompressed GPI 1.2. Inclusion is like a 25% increase.

kltm commented 2 years ago

I believe this could be hacked in like https://github.com/geneontology/neo/blob/10210c1e07218f74fa02b86237e672001bffc7de/Makefile#L41-L45 Probably best to time the addition after the next update cycle.

kltm commented 2 years ago

@pgaudet Would it be possible to get this as a compressed file from upstream like the others, for consistency and size?

pgaudet commented 2 years ago

@alexsign Can you please provide this data as a compressed file like the others GPIs?

Thanks, Pascale

alexsign commented 2 years ago

@pgaudet file is gziped now and will be compressed in the future releases

kltm commented 2 years ago

@alexsign Great, thank you.

kltm commented 2 years ago

@cmungall Part of the Makefile is running gpi2obo.pl, which would like arguments for species name and ontology id. If these are not provided, they essentially default to "generic". What would be good values in this case?

cmungall commented 2 years ago

Suffix with the taxon ID for now. Obviously this is not super-friendly but we should progress incrementally. It's better to have some disambguator than autocomplete flooded by 1000 rplNs

When we rewrite my hacky old scripts from perl to python we will fix the whole naming strategy

kltm commented 2 years ago

@cmungall Clarifying work: I'll extend gpi2obo.pl so that when a flag is on (for this case) the usual default value for species name is replaced with the taxon id.

What about for ontology id then?

cmungall commented 2 years ago

@alexsign: thanks for doing this, awesome!

Can you populate the properties field? I assume all should have db_subset=Swiss-Prot

@kltm: Should we not document this here: https://github.com/geneontology/go-site/blob/master/metadata/datasets/goa.yaml

together with inclusion/exclusion criteria (I assume this is only SP)

cmungall commented 2 years ago

What about for ontology id then? uniprot_reviewed_virus_bacteria.{obo,owl}

cmungall commented 2 years ago

Just want to record the implications here:

This is fine, no discussion necessary, just recording this here in case there is any confusion later

kltm commented 2 years ago

@cmungall Yes, I thought about that, but:

I'm happy to go a more "normal" path as well, but would need to move a little slower.

cmungall commented 2 years ago

we can't really fill in species and taxon, which seemed odd to me (although still technically passing schema validation)

well we could list all 6k taxa in the yaml, but I agree this is suboptimal

I'm a little nervous about adding something oddly named and not normally handled into a main GO pipline metadata file as I'm not completely understanding all of the exceptions and handling rules around goa (and there are a lot) it's only for NEO at this point, so bolting it in like we did for sars-cov2 seemed expedient

totally fair, let's just proceed for now

kltm commented 2 years ago

@cmungall Locally tested PR that may be able to close this issue here https://github.com/geneontology/neo/pull/79 . If taken, this would go live ~next Friday, unless people want this sooner. Tagging @pgaudet @vanaukenk

alexsign commented 2 years ago

@kltm @cmungall I added db_subset=Swiss-Prot in the code. The updated file should be available in a week time with the new GOA release data. please let me know if you need it sooner.

kltm commented 2 years ago

Currently running full post-merge test.

vanaukenk commented 2 years ago

@kltm - do we need to do any testing on the Noctua autocompletes?

kltm commented 2 years ago

From a discussion w/ @cmungall yesterday, I wanted to try and get a file product that could be eyeballed. A major concern was that this could flood out other things (a 25% increase in size with ~350k entities). While I'm testing the product production now, we could defer rolling this out until there is somebody available to take a look at it live.

vanaukenk commented 2 years ago

It'd be good to have the Swiss-Prot curators test for ids they'd expect to curate, and I'm happy to do other id testing just in case.

pgaudet commented 2 years ago

Where can this be tested? Is this on Noctua or on some test server?

kltm commented 2 years ago

@pgaudet This would be tested by running the pipeline, looking at the results, then apply to the autocomplete server (reverting if we don't like it). That said, this is currently blocked by #80 .

pgaudet commented 2 years ago

Just confirmed with @pmasson55 that the Swiss-Prot reviewed is OK (for the record also, in response to https://github.com/geneontology/neo/issues/77#issuecomment-1024769754)

kltm commented 2 years ago

Created working branch https://github.com/geneontology/neo/tree/issue-80-new-virus-bacteria

kltm commented 2 years ago

Talking to @vanaukenk , we'll be temporarily switching back to ecocyc to get a NEO release out before continuing work.

pgaudet commented 2 years ago

This is now a dupe of #82