KoslickiLab / KEGG_data_extraction

0 stars 1 forks source link

Missing KOs in scraped data #2

Open dkoslicki opened 2 years ago

dkoslicki commented 2 years ago

It appears there are KOs that exist in KEGG that aren't being scraped. For example: K17348. This shows up on their website here, as well as in the hierarchy:

/data/shared_data/KEGG_KO_hierarchy/results$ grep K17348 kegg_ko_edge_df.txt
04147 Exosome   K17348

But even though the website shows associated protein sequences, these are not in the data dump:

/data/shared_data/KEGG_data/genes_dump$ grep -m 1 K17348 kegg_genes.faa

returns nothing.

chunyuma commented 2 years ago

@dkoslicki, I'm sorry that this issue makes you confused. The reason why K17348 is mssing in the data dump is that the organisms associated with this KO are not one of 'Archaea', 'Bacteria', or 'Fungi'. When I downloaded gene data before, I only considers these three categories. If you look at the gene section on the website again, you will find that all genes are from the organisms vertebrates, mollusks or flatworms (Please also see the list below).

array([['hsa', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ptr', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pps', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ggo', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pon', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['nle', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mcc', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mcf', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['csab', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['caty', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['panu', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['rro', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['rbb', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['tfn', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pteh', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['cjc', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['sbq', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['csyr', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mmur', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['oga', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mmu', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mcal', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mpah', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['rno', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mcoc', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mun', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['cge', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pleu', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ngi', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['hgl', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['cpoc', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ccan', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['dord', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['dsp', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ocu', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['opi', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['tup', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['cfa', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['vvp', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['vlg', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['aml', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['umr', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['uah', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['uar', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['oro', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['elk', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mpuf', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['eju', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['zca', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mlx', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['fca', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pyu', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pbg', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ptg', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ppad', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['aju', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['hhv', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['bta', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['bom', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['biu', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['bbub', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['chx', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['oas', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['oda', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ccad', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ssc', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['cfr', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['cbai', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['cdk', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['bacu', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['lve', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['oor', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['dle', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pcad', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['psiu', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ecb', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['epz', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['eai', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['myb', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['myd', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mmyo', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mlf', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mna', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pkl', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['hai', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['dro', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['shon', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ajm', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pdic', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['phas', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mmf', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['rfq', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pale', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pgig', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pvp', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['ray', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mjv', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['tod', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['sara', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['lav', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['tmu', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['dnm', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['mdo', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['shr', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['pcw', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['oaa', 'Eukaryotes;Animals;Vertebrates;Mammals'],
       ['gga', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['pcoc', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['mgp', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['cjo', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['nmel', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['apla', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['acyg', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['aful', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['tgu', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['lsr', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['scan', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['pmoa', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['otc', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['pruf', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['gfr', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['fab', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['phi', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['pmaj', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['ccae', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['ccw', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['etl', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['zab', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['fpg', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['fch', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['clv', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['egz', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['nni', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['acun', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['tala', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['padl', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['achc', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['aam', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['arow', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['npd', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['dne', 'Eukaryotes;Animals;Vertebrates;Birds'],
       ['asn', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['amj', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['cpoo', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['ggn', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['pss', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['cmy', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['cpic', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['tst', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['cabi', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['mrv', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['acs', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['pvt', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['sund', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['pbi', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['pmur', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['tsr', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['pgut', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['vko', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['pmua', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['zvi', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['gja', 'Eukaryotes;Animals;Vertebrates;Reptiles'],
       ['xla', 'Eukaryotes;Animals;Vertebrates;Amphibians'],
       ['xtr', 'Eukaryotes;Animals;Vertebrates;Amphibians'],
       ['npr', 'Eukaryotes;Animals;Vertebrates;Amphibians'],
       ['rtem', 'Eukaryotes;Animals;Vertebrates;Amphibians'],
       ['bbuf', 'Eukaryotes;Animals;Vertebrates;Amphibians'],
       ['dre', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['srx', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['sanh', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['sgh', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['ccar', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['caua', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['ipu', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['phyp', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['smeo', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['amex', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['eee', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['tru', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['lco', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['ncc', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['cgob', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['ely', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['plep', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['sluc', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['ecra', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['pflv', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['gat', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['ppug', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['msam', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['cud', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['mze', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['onl', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['oau', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['ola', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['oml', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['xma', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['xco', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['xhe', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['pret', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['pfor', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['plai', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['pmei', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['gaf', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['cvg', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['ctul', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['nfu', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['kmr', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['alim', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['aoce', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['csem', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['pov', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['ssen', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['hhip', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['lcf', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['sdu', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['slal', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['xgl', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['hcq', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['bpec', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['malb', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['sasa', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['otw', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['omy', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['ogo', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['one', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['salp', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['snh', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['els', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['sfm', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['pki', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['aang', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['loc', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['pspa', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['arut', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['lcm', 'Eukaryotes;Animals;Vertebrates;Fishes'],
       ['cmk', 'Eukaryotes;Animals;Vertebrates;Cartilaginous fishes'],
       ['rtp', 'Eukaryotes;Animals;Vertebrates;Cartilaginous fishes'],
       ['sclv', 'Eukaryotes;Animals;Ascidians'],
       ['ccin', 'Eukaryotes;Animals;Arthropods;Insects'],
       ['otu', 'Eukaryotes;Animals;Arthropods;Insects'],
       ['dsv', 'Eukaryotes;Animals;Arthropods;Chelicerates'],
       ['rsan', 'Eukaryotes;Animals;Arthropods;Chelicerates'],
       ['rmp', 'Eukaryotes;Animals;Arthropods;Chelicerates'],
       ['tut', 'Eukaryotes;Animals;Arthropods;Chelicerates'],
       ['pcan', 'Eukaryotes;Animals;Mollusks'],
       ['bgt', 'Eukaryotes;Animals;Mollusks'],
       ['hrf', 'Eukaryotes;Animals;Mollusks'],
       ['crg', 'Eukaryotes;Animals;Mollusks'],
       ['egl', 'Eukaryotes;Animals;Flatworms']], dtype=object)
dkoslicki commented 2 years ago

Ah, that makes sense. so @raquellewei can remove those KOs from the hierarchy that don't show up in kegg_genes.faa, correct?

chunyuma commented 2 years ago

Correct, I can give @raquellewei such KO list in which each one is associated with at least one organism that belongs to either of 'Archaea', 'Bacteria', or 'Fungi'. Or, she can directly remove KOs that don't show up in kegg_genes.faa.

@raquellewei, please let me know which way you prefer.

dkoslicki commented 2 years ago

Go ahead and remove the KOs that don't show up in kegg_genes.faa (i.e. don't include Eukaryotes and the like the the edge list)