Open hypnopump opened 1 year ago
Update: hacked together a script to cross-check data from Rhea downloadable files which covers around 90% of the chemical compounds. The ones not covered then to be fragments of peptides or monomers (of polymers), etc. See: https://gist.github.com/hypnopump/bef3a2e34e810a529f159de015074926
Update:
hacked together a scrapper with urllib and bs4 which seems to cover the remaining points. at this point i think it might be worth it to do everything with the scrapper for consistency...
dont want to have 3 different branches in the parsing workflow as whether the ids are there or not in the downloadable master files.... Example: (1->3)-alpha-D-glucosyl-[(1->6)-alpha-D-glucosyl](n)
from https://www.rhea-db.org/rhea/57036 does not have a matching chebi ID in the RDF
file here but it has a ChEBI ID in the webpage.
Will update the gist with a better version relying on web parsing soon
Managed to create the script and parse the whole DB: parsed_rhea.json.zip
I would be nice to add UniProt as a source, especially the following items:
For enzymes, it would be very useful to have the catalytic activity expressed formally in addition (SMIRKS) to link the bio-moleuclar knowledge description to formal language. A source for this could be the Rhea DB (already referenced by UniProt for reactions), whose data is available for download and easily parseable.
I'm drafting a proof of concept that will be posted here (or as a PR if people find it useful) soon.