Closed pgaudet closed 2 years ago
Noting that this is ~350k line uncompressed GPI 1.2. Inclusion is like a 25% increase.
I believe this could be hacked in like https://github.com/geneontology/neo/blob/10210c1e07218f74fa02b86237e672001bffc7de/Makefile#L41-L45 Probably best to time the addition after the next update cycle.
@pgaudet Would it be possible to get this as a compressed file from upstream like the others, for consistency and size?
@alexsign Can you please provide this data as a compressed file like the others GPIs?
Thanks, Pascale
@pgaudet file is gziped now and will be compressed in the future releases
@alexsign Great, thank you.
@cmungall Part of the Makefile is running gpi2obo.pl
, which would like arguments for species name and ontology id. If these are not provided, they essentially default to "generic". What would be good values in this case?
Suffix with the taxon ID for now. Obviously this is not super-friendly but we should progress incrementally. It's better to have some disambguator than autocomplete flooded by 1000 rplNs
When we rewrite my hacky old scripts from perl to python we will fix the whole naming strategy
@cmungall Clarifying work: I'll extend gpi2obo.pl
so that when a flag is on (for this case) the usual default value for species name is replaced with the taxon id.
What about for ontology id then?
@alexsign: thanks for doing this, awesome!
Can you populate the properties field? I assume all should have db_subset=Swiss-Prot
@kltm: Should we not document this here: https://github.com/geneontology/go-site/blob/master/metadata/datasets/goa.yaml
together with inclusion/exclusion criteria (I assume this is only SP)
What about for ontology id then?
uniprot_reviewed_virus_bacteria.{obo,owl}
Just want to record the implications here:
This is fine, no discussion necessary, just recording this here in case there is any confusion later
@cmungall Yes, I thought about that, but:
I'm happy to go a more "normal" path as well, but would need to move a little slower.
we can't really fill in species and taxon, which seemed odd to me (although still technically passing schema validation)
well we could list all 6k taxa in the yaml, but I agree this is suboptimal
I'm a little nervous about adding something oddly named and not normally handled into a main GO pipline metadata file as I'm not completely understanding all of the exceptions and handling rules around goa (and there are a lot) it's only for NEO at this point, so bolting it in like we did for sars-cov2 seemed expedient
totally fair, let's just proceed for now
@cmungall Locally tested PR that may be able to close this issue here https://github.com/geneontology/neo/pull/79 . If taken, this would go live ~next Friday, unless people want this sooner. Tagging @pgaudet @vanaukenk
@kltm @cmungall I added db_subset=Swiss-Prot in the code. The updated file should be available in a week time with the new GOA release data. please let me know if you need it sooner.
Currently running full post-merge test.
@kltm - do we need to do any testing on the Noctua autocompletes?
From a discussion w/ @cmungall yesterday, I wanted to try and get a file product that could be eyeballed. A major concern was that this could flood out other things (a 25% increase in size with ~350k entities). While I'm testing the product production now, we could defer rolling this out until there is somebody available to take a look at it live.
It'd be good to have the Swiss-Prot curators test for ids they'd expect to curate, and I'm happy to do other id testing just in case.
Where can this be tested? Is this on Noctua or on some test server?
@pgaudet This would be tested by running the pipeline, looking at the results, then apply to the autocomplete server (reverting if we don't like it). That said, this is currently blocked by #80 .
Just confirmed with @pmasson55 that the Swiss-Prot reviewed is OK (for the record also, in response to https://github.com/geneontology/neo/issues/77#issuecomment-1024769754)
Created working branch https://github.com/geneontology/neo/tree/issue-80-new-virus-bacteria
Talking to @vanaukenk , we'll be temporarily switching back to ecocyc to get a NEO release out before continuing work.
This is now a dupe of #82
The file is here:
http://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed_virus_bacteria.gpi
@kltm please let me know if you need more information.
Thanks, Pascale