geneontology / gaferencer

Perform annotation deepening and satisfiability checking for GO annotations in GAFs
MIT License
0 stars 1 forks source link

Pre-generate a taxon-term-tuple file indicating which combos are valid #1

Closed cmungall closed 5 years ago

cmungall commented 6 years ago

[this ticket may be moved to a different repo. A lot of the functionality is the same as geferencer so putting here for now]

given a list of taxa (species and intermediate nodes), pre-generate a table indicating valid combos. use go-plus and either ncbitaxon.owl or some subset as additional inputs.

This couldbe done right now by making a fake GAF with every term x taxon combo, seems a bit hacky?

We should either adapt gaferencer or just lift the code parts we need into a new tool.

An additional wrinkle here is that PAINT uses it's own symbols for some intermediate nodes. We can address on future pass.

Picture from USC: image

cmungall commented 6 years ago

This is the format that PAINT currently consumes:

http://data.pantherdb.org/TaxonConstraintsLookup.txt

The tool need not produce something identical to start with; @dustine32 can help with post-transormation

balhoff commented 5 years ago

@dustine32 where can I get the list of taxon IDs for PAINT?

dustine32 commented 5 years ago

@balhoff Dang! How did I not see this until just now?

I got a list for you here: paint_taxons.txt

It's just the list of NCBITaxon:###'s so let me know if you need the taxon labels too. Thanks!

balhoff commented 5 years ago

Thanks that's perfect.

balhoff commented 5 years ago

@dustine32 @cmungall does this format work? It's tsv.

GOterm  NCBITaxon:100226    NCBITaxon:10090 NCBITaxon:10116 NCBITaxon:10228 NCBITaxon:1111708   NCBITaxon:1117  NCBITaxon:1118  NCBITaxon:117571    NCBITaxon:1206794   NCBITaxon:1224  NCBITaxon:122586    NCBITaxon:1236  NCBITaxon:1239  NCBITaxon:1300  NCBITaxon:136   NCBITaxon:13616 NCBITaxon:1385  NCBITaxon:1386  NCBITaxon:1437010   NCBITaxon:1437201   NCBITaxon:147538    NCBITaxon:147550    NCBITaxon:1485  NCBITaxon:15368 NCBITaxon:164328    NCBITaxon:169963    NCBITaxon:171101    NCBITaxon:1763  NCBITaxon:178306    NCBITaxon:183924    NCBITaxon:184922    NCBITaxon:185431    NCBITaxon:186801    NCBITaxon:186826    NCBITaxon:188787    NCBITaxon:188937    NCBITaxon:189518    NCBITaxon:190304    NCBITaxon:190485    NCBITaxon:2 NCBITaxon:2037  NCBITaxon:207598    NCBITaxon:208964    NCBITaxon:211586    NCBITaxon:214684    NCBITaxon:2157  NCBITaxon:2236  NCBITaxon:224308    NCBITaxon:224324    NCBITaxon:224756    NCBITaxon:224911    NCBITaxon:2259  NCBITaxon:226186    NCBITaxon:2266  NCBITaxon:226900    NCBITaxon:227321    NCBITaxon:227377    NCBITaxon:237561    NCBITaxon:237631    NCBITaxon:243090    NCBITaxon:243230    NCBITaxon:243231    NCBITaxon:243232    NCBITaxon:243273    NCBITaxon:243274    NCBITaxon:243277    NCBITaxon:251221    NCBITaxon:272561    NCBITaxon:273057    NCBITaxon:2759  NCBITaxon:28211 NCBITaxon:28221 NCBITaxon:2836  NCBITaxon:28377 NCBITaxon:284591    NCBITaxon:284811    NCBITaxon:284812    NCBITaxon:28890 NCBITaxon:289376    NCBITaxon:29760 NCBITaxon:3055  NCBITaxon:314145    NCBITaxon:314146    NCBITaxon:3193  NCBITaxon:321614    NCBITaxon:3218  NCBITaxon:32443 NCBITaxon:324602    NCBITaxon:32523 NCBITaxon:32524 NCBITaxon:32561 NCBITaxon:33083 NCBITaxon:330879    NCBITaxon:33090 NCBITaxon:33154 NCBITaxon:33208 NCBITaxon:33213 NCBITaxon:33317 NCBITaxon:33392 NCBITaxon:33511 NCBITaxon:33554 NCBITaxon:33630 NCBITaxon:33634 NCBITaxon:3398  NCBITaxon:35128 NCBITaxon:356   NCBITaxon:36329 NCBITaxon:367110    NCBITaxon:3694  NCBITaxon:3702  NCBITaxon:374847    NCBITaxon:3847  NCBITaxon:39107 NCBITaxon:39947 NCBITaxon:40674 NCBITaxon:4081  NCBITaxon:41665 NCBITaxon:418459    NCBITaxon:422676    NCBITaxon:436308    NCBITaxon:441771    NCBITaxon:44689 NCBITaxon:4479  NCBITaxon:451864    NCBITaxon:451871    NCBITaxon:45351 NCBITaxon:4577  NCBITaxon:4734  NCBITaxon:4751  NCBITaxon:4783  NCBITaxon:4890  NCBITaxon:4892  NCBITaxon:4893  NCBITaxon:5052  NCBITaxon:50557 NCBITaxon:515635    NCBITaxon:5204  NCBITaxon:54126 NCBITaxon:543   NCBITaxon:554915    NCBITaxon:559292    NCBITaxon:5654  NCBITaxon:5664  NCBITaxon:5722  NCBITaxon:5759  NCBITaxon:5782  NCBITaxon:5786  NCBITaxon:5794  NCBITaxon:5888  NCBITaxon:6020  NCBITaxon:6072  NCBITaxon:6237  NCBITaxon:6238  NCBITaxon:6239  NCBITaxon:632   NCBITaxon:64091 NCBITaxon:6412  NCBITaxon:665079    NCBITaxon:6656  NCBITaxon:6669  NCBITaxon:684364    NCBITaxon:69014 NCBITaxon:6945  NCBITaxon:7070  NCBITaxon:71275 NCBITaxon:71421 NCBITaxon:7147  NCBITaxon:715340    NCBITaxon:715989    NCBITaxon:7165  NCBITaxon:7227  NCBITaxon:7668  NCBITaxon:7711  NCBITaxon:7718  NCBITaxon:7719  NCBITaxon:7739  NCBITaxon:7918  NCBITaxon:7955  NCBITaxon:8090  NCBITaxon:81824 NCBITaxon:83332 NCBITaxon:83333 NCBITaxon:8364  NCBITaxon:85003 NCBITaxon:85007 NCBITaxon:85962 NCBITaxon:9031  NCBITaxon:91061 NCBITaxon:91835 NCBITaxon:9258  NCBITaxon:93061 NCBITaxon:9347  NCBITaxon:9526  NCBITaxon:9544  NCBITaxon:9595  NCBITaxon:9598  NCBITaxon:9606  NCBITaxon:9615  NCBITaxon:9685  NCBITaxon:976   NCBITaxon:9787  NCBITaxon:9796  NCBITaxon:9823  NCBITaxon:9913  NCBITaxon:99287
GO:0000001  0   1   1   1   0   0   0   1   1   0   0   0   0   0   0   1   0   0   1   1   1   1   0   1   1   0   0   0   0   0   1   1   0   0   0   0   0   0   0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   0   0   1   0   1   1   0   0   0   0   0   0   0   0   0   0   1   0   0   1   1   1   1   1   0   0   1   1   1   1   1   1   1   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   1   1   1   1   0   1   1   1   1   1   1   1   1   0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   1   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0   1   1   1   1   1   0   1   1   1   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0   1   1   0   0   1   0   1   1   0   1   1   1   1   1   1   1   1   0   1   1   1   1   0




1 == 'satisfiable'

dustine32 commented 5 years ago

I think that format looks great! Tagging @mugitty to see if this also works for the PAINT tool.

balhoff commented 5 years ago

Cool—I could do JSON instead if you want.

mugitty commented 5 years ago

The tab format works for me. I need to convert the taxon ids to the species we have in the trees.
What about internal nodes with Ancestral species that do not have taxon ids?

dustine32 commented 5 years ago

@mugitty Oh right, I forgot about the ancestral species w/o taxon IDs. We should look up a few examples to see if we could actually map these to taxon IDs. There's this handy site I found while looking up NCBITaxon:10090 for mouse that let's you trace up to ancestral terms like Eukaryota (NCBITaxon:2759).

I have the old taxon constraint lookup file that I can try mapping all species to taxon IDs.

cmungall commented 5 years ago

Looks like bioportal is out of date

For browsing, you can also just use the PURL http://purl.obolibrary.org/obo/NCBITaxon_10090

or OLS

https://www.ebi.ac.uk/ols/ontologies/ncbitaxon/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FNCBITaxon_10090

Let's see how many of the intermediates can be mapped to NCBITaxon - possibly all. If not we have to make axioms to merge into the ontology

mugitty commented 5 years ago

We have the taxon ids for some of the organisms. For example, in the current version of PANTHER. These are the organisms and corresponding taxon ids.
"organism","taxon_id" "Alveolata-Stramenopiles","" "Archaea-Eukaryota","" "Artiodactyla","" "Aconoidasida","422676" "Actinobacteridae","85003" "Actinomycetales","2037" "Alphaproteobacteria","28211" "Alveolata","33630" "Amniota","32524" "Amoebozoa","554915" "Anolis carolinensis","28377" "Anopheles gambiae","7165" "Apicomplexa","5794" "Aquifex aeolicus","224324" "Arabidopsis thaliana","3702" "Archaea","2157" "Arthropoda","6656" "Ascomycota","4890" "Ashbya gossypii","284811" "Aspergillus","5052" "Bacillales","1385" "Bacillariophyta","2836" "Bacilli","91061" "Bacillus","1386" "Bacillus_cereus_group","" "bacteroidetes-chlorobi","" "BEP_clade","" "Craniata-Cephalochordata","" "Delta-epsilon_subdivisions","" "Excavates","" "Fornicata-Parabasalids","" "Ciona","7718" "Hexapoda-Crustacea","" "Homo-Pan","" "LUCA","" "Metazoa-Choanoflagellida","" "Osteichthyes","" "Pezizomycotina-Saccharomycotina","" "Rhabditida-Chromadorea","" "Saccharomycetaceae-Candida","" "Sordariomycetes-Leotiomycetes","" "Theria","" "Unikonts","" "Bacillus cereus","226900" "Bacillus subtilis","224308" "Bacteroides thetaiotaomicron","226186" "Bacteroidetes","976" "Basidiomycota","5204" "Batrachochytrium dendrobatidis","684364" "Bilateria","33213" "Boreoeutheria","1437010" "Bos taurus","9913" "Brachypodium distachyon","15368" "Bradyrhizobium diazoefficiens","224911" "Branchiostoma floridae","7739" "Caenorhabditis","6237" "Caenorhabditis briggsae","6238" "Caenorhabditis elegans","6239" "Candida albicans","237561" "Canis lupus familiaris","9615" "Carnivora","33554" "Catarrhini","9526" "Chlamydia trachomatis","272561" "Chlamydomonas reinhardtii","3055" "Chloroflexus aurantiacus","324602" "Chordata","7711" "Chroococcales","1118" "Ciona intestinalis","7719" "Clostridia","186801" "Clostridium","1485" "Clostridium botulinum","441771" "commelinids","4734" "Stramenopiles","33634" "Cyanobacteria","1117" "Corynebacterineae","85007" "Coxiella burnetii","227377" "Cryptococcus neoformans","214684" "Danio rerio","7955" "Daphnia pulex","6669" "Deinococci","188787" "Deinococcus radiodurans","243230" "Deltaproteobacteria","28221" "Deuterostomia","33511" "Dictyoglomus turgidum","515635" "Dictyosteliida","33083" "Dictyostelium","5782" "Dictyostelium discoideum","44689" "Dictyostelium purpureum","5786" "Dikarya","451864" "Diptera","7147" "Drosophila melanogaster","7227" "Ecdysozoa","1206794" "Embryophyta","3193" "Emericella nidulans","227321" "Endopterygota","33392" "Entamoeba histolytica","5759" "Enterobacteriaceae","543" "Equus caballus","9796" "Escherichia coli","83333" "Euarchontoglires","314146" "Eubacteria","2" "Eukaryota","2759" "Eumetazoa","6072" "Eurotiomycetidae","451871" "Euryarchaeota","28890" "Euteleostomi","117571" "Eutheria","9347" "fabids","91835" "Felis catus","9685" "Firmicutes","1239" "Fungi","4751" "Fusobacterium nucleatum","190304" "Gallus gallus","9031" "Gammaproteobacteria","1236" "Geobacter sulfurreducens","243231" "Giardia intestinalis","184922" "Gloeobacter violaceus","251221" "Glycine max","3847" "Gorilla gorilla gorilla","9595" "Haemophilus influenzae","71421" "Halobacteriaceae","2236" "Halobacterium salinarum","64091" "Helicobacter pylori","85962" "helobdella robusta","6412" "Homo sapiens","9606" "Homininae","207598" "Insecta","50557" "Ixodes scapularis","6945" "Korarchaeum cryptofilum","374847" "Lactobacillales","186826" "Laurasiatheria","314145" "Leishmania major","5664" "lepisosteus oculatus","7918" "Leptospira interrogans","189518" "Listeria monocytogenes","169963" "Macaca mulatta","9544" "Magnoliophyta","3398" "Mammalia","40674" "Metazoa","33208" "Methanocaldococcus jannaschii","243232" "Methanomicrobia","224756" "Methanosarcina acetivorans","188937" "Monodelphis domestica","13616" "Monosiga brevicollis","81824" "Murinae","39107" "Mus musculus","10090" "Mycobacterium","1763" "Mycobacterium tuberculosis","83332" "mycoplasma genitalium","243273" "Neisseria meningitidis serogroup b","122586" "Nematostella vectensis","45351" "Neopterygii","41665" "Neosartorya fumigata","330879" "Neurospora crassa","367110" "Nitrosopumilus maritimus","436308" "Oligohymenophorea","6020" "Opisthokonts","33154" "Ornithorhynchus anatinus","9258" "Oryza sativa","39947" "Pan troglodytes","9598" "Paramecium tetraurelia","5888" "Pentapetalae","1437201" "Perissodactyla","9787" "Pezizomycotina","147538" "Phaeosphaeria nodorum","321614" "Physcomitrella patens","3218" "Phytophthora","4783" "Phytophthora ramorum","164328" "Plasmodium falciparum","36329" "Pleosporineae","715340" "Poaceae","4479" "Populus trichocarpa","3694" "Pristionchus pacificus","54126" "Proteobacteria","1224" "Protostomia","33317" "Pseudomonas aeruginosa","208964" "Puccinia graminis","418459" "Pyrobaculum aerophilum","178306" "Rattus norvegicus","10116" "Rhizobiales","356" "Rhodopirellula baltica","243090" "rosids","71275" "Saccharomyces cerevisiae","559292" "Saccharomycetaceae","4893" "Saccharomycetales","4892" "Salmonella typhimurium","99287" "Sauria","32561" "Schizosaccharomyces pombe","284812" "Sclerotinia sclerotiorum","665079" "Shewanella oneidensis","211586" "Solanum lycopersicum","4081" "Sordariomyceta","715989" "Sordariomycetes","147550" "Spirochaetales","136" "Staphylococcus aureus","93061" "Streptococcaceae","1300" "Streptococcus pneumoniae","171101" "Streptomyces coelicolor","100226" "Strongylocentrotus purpuratus","7668" "Sulfolobus solfataricus","273057" "Sus scrofa","9823" "Synechocystis","1111708" "Teleostei","32443" "Tetrapoda","32523" "Thalassiosira pseudonana","35128" "Thermococcaceae","2259" "Thermococcus kodakaraensis","69014" "Thermodesulfovibrio yellowstonii","289376" "Thermoproteales","2266" "Thermoprotei","183924" "Thermotoga maritima","243274" "Oryzias latipes","8090" "Tribolium castaneum","7070" "Trichomonas vaginalis","5722" "Trichoplax adhaerens","10228" "Trypanosoma brucei","185431" "Trypanosomatidae","5654" "Ustilago maydis","237631" "Vibrio cholerae","243277" "Viridiplantae","33090" "Vitis vinifera","29760" "Xanthomonas campestris","190485" "Xenopus tropicalis","8364" "Yarrowia lipolytica","284591" "Yersinia pestis","632" "Zea mays","4577"

We need to handle the blank ones. @dustine32 , I retrieved this information using the following query: select organism, taxon_id from organism where classification_version_sid = 24;

cmungall commented 5 years ago

Can we keep the csv in github so we can make PRs against it?

I have 1 to fill in I am sure of, and 2 less so

Theria is NCBITaxon:32525

Artiodactyla is Cetartiodactyla minus whales. Do you even have whale protein in panther? Can you map up?

                    is_a NCBITaxon:91561 ! Cetartiodactyla [SYNONYM: "even-toed ungulates" (related)] [SYNONYM: "whales, hippos, ruminants, pigs, camels etc." (exact)]
                     is_a NCBITaxon:35497 ! Suina ***  [SYNONYM: "Artiodactyla" (related)] [SYNONYM: "Suiformes" (related)]
                     is_a NCBITaxon:948947 ! unclassified Cetartiodactyla [SYNONYM: "unclassified Artiodactyla" (related)]
                     is_a NCBITaxon:9721 ! Cetacea [SYNONYM: "whale" (exact)] [SYNONYM: "whales" (exact)] [SYNONYM: "whales & dolphins" (related)] [SYNONYM: "whales, dolphins, and porpoises" (exact)]
                     is_a NCBITaxon:9831 ! Hippopotamidae ***  [SYNONYM: "Artiodactyla" (related)] [SYNONYM: "Suiformes" (related)]
                     is_a NCBITaxon:9834 ! Tylopoda ***  [SYNONYM: "Artiodactyla" (related)]
                     is_a NCBITaxon:9845 ! Ruminantia ***  [SYNONYM: "Artiodactyla" (related)]

Osteichthyes is a name given way up at Actinopterygii but also for Dipnoi. @balhoff any ideas?

is_a NCBITaxon:117571 ! Euteleostomi [SYNONYM: "bony vertebrates" (exact)] is_a NCBITaxon:7898 ! Actinopterygii [SYNONYM: "Actinopterygi" (related)] [SYNONYM: "bony fishes" (related)] [SYNONYM: "fish" (exact)] [SYNONYM: "fishes" (exact)] [SYNONYM: "Osteichthyes" (related)] [SYNONYM: "ray-finned fishes" (exact)] is_a NCBITaxon:8287 ! Sarcopterygii is_a NCBITaxon:118072 ! Coelacanthimorpha [SYNONYM: "Actinistia" (related)] [SYNONYM: "Choanichthyes" (related)] [SYNONYM: "Crossopterygii" (related)] [SYNONYM: "fish" (exact)] [SYNONYM: "fishes" (exact)] [SYNONYM: "lobe-finned fishes" (exact)] is_a NCBITaxon:7894 ! Coelacanthiformes [SYNONYM: "coelacanths" (related)] [SYNONYM: "lobe-finned fishes" (exact)] [SYNONYM: "Osteichthyes" (related)] is_a NCBITaxon:1338369 ! Dipnotetrapodomorpha is_a NCBITaxon:7878 ! Dipnoi *** [SYNONYM: "Choanichthyes" (related)] [SYNONYM: "Dipneusti" (related)] [SYNONYM: "dipnoans" (exact)] [SYNONYM: "Dipnomorpha" (related)] [SYNONYM: "fish" (exact)] [SYNONYM: "fishes" (exact)] [SYNONYM: "lungfishes" (exact)] [SYNONYM: "lungfishes" (related)] [SYNONYM: "Osteichthyes" (related)]

"Archaea-Eukaryota" is a pretty interesting one from an ontophylogenetic perspective!

balhoff commented 5 years ago

For 'Osteichthyes' I would use Euteleostomi:

balhoff commented 5 years ago

I was talking with @dougli1sqrd and @kltm about this. They were interested in combining this list with any other taxon IDs found in GAFs and running one step where a single master table is produced. Is there any reason not to have a few additional taxa in there? Or does this list already include all possible GAF taxa anyway?

cmungall commented 5 years ago

there are many gaf taxa

if we are talking about the non goa_uniprot_all subset, then I would expect this to be covered in the panther/qfo set.

dougli1sqrd commented 5 years ago

I want to thumbs up @balhoff you're tsv proposal above. If we can have gaferencer take in a given set of taxons and an ontology, gaferencer could create the table. Then we can use that table easily enough later, and gaferencer doesn't even need to know about the gafs, and we can let downstream deal with that.

cmungall commented 5 years ago

Recall that there are two separate pieces of functionality here.

  1. Make the taxon-term table (no GAF required)
  2. Pre-computing inferences on gene/products in a GAF/GPAD (GAF/GPAD required)

Just checking we're on the same page

dougli1sqrd commented 5 years ago

Maybe we should discuss in person tomorrow, but it makes a little more sense (maybe?) if ontobio has just the Term X Taxon table, and then as ontobio validates each line, it'll input Term X Taxon and decide if the taxon is valid. Otherwise, we'll have to move the pipeline around somewhat, as gaferencer currently requires the gafs to be downloaded. But ontobio validation does the downloading of source gafs. So we'd have to break out the downloading of source gafs elsewhere, let gaferencer run over all the gafs, then let ontobio run over all the gafs again. That seems less ideal. But also maybe I'm missing some context?

I mean I know we also want gaferencer to perform deepening and such also, so maybe that's the only way to do it? But if we could do the deepening later, after ontobio already had parsed, maybe that's better? Unsure.

dustine32 commented 5 years ago

Hey @balhoff ! Is a current example of the gaferencer-generated taxon-term table available for download somewhere? @mugitty and I are working on using it within the PAINT tool and/or update pipeline. Thanks!

kltm commented 5 years ago

@dustine32 In an ideal world, you might want to pick this up from either the monthly versioned release or the snapshot pipeline.

dustine32 commented 5 years ago

@kltm Right, I think using the monthly release file would be simpler for us to include in our monthly updates.

I'm guessing this hasn't been generated yet for either release or snapshot? Would it likely be under the ontology/extensions folder?

kltm commented 5 years ago

As a non-ontology product, it would likely be under products/. This is has not yet been produced.

dustine32 commented 5 years ago

Cool, thanks for the update!

balhoff commented 5 years ago

@dustine32 would you call this done? Or are there any gaferencer changes you need in order to make use of the file?

dustine32 commented 5 years ago

@balhoff Yes! I can close this. I ran gaferencer taxa to generate the taxon term table and had to do a little filling in to accommodate our "made up" ancestor species that don't have taxon IDs. This modified file has been tested and would only be used by the PAINT tool and possibly the PAINT IBA generator so I think your wonderful work is complete. Thanks!

BTW I plan to have this "filling in" code checked-in under fullgo_paint_update since it will run with every monthly PAINT update after downloading the GO pipeline-generated taxon term table (once that's a part of that pipeline).