OpenBioML / chemnlp

ChemNLP project
MIT License
148 stars 45 forks source link

Adding Uniprot, X-linking to reaction DBs for enzymes #191

Open hypnopump opened 1 year ago

hypnopump commented 1 year ago

I would be nice to add UniProt as a source, especially the following items:

For enzymes, it would be very useful to have the catalytic activity expressed formally in addition (SMIRKS) to link the bio-moleuclar knowledge description to formal language. A source for this could be the Rhea DB (already referenced by UniProt for reactions), whose data is available for download and easily parseable.

I'm drafting a proof of concept that will be posted here (or as a PR if people find it useful) soon.

hypnopump commented 1 year ago

Update: hacked together a script to cross-check data from Rhea downloadable files which covers around 90% of the chemical compounds. The ones not covered then to be fragments of peptides or monomers (of polymers), etc. See: https://gist.github.com/hypnopump/bef3a2e34e810a529f159de015074926

hypnopump commented 1 year ago

Update:

hacked together a scrapper with urllib and bs4 which seems to cover the remaining points. at this point i think it might be worth it to do everything with the scrapper for consistency...

dont want to have 3 different branches in the parsing workflow as whether the ids are there or not in the downloadable master files.... Example: (1->3)-alpha-D-glucosyl-[(1->6)-alpha-D-glucosyl](n) from https://www.rhea-db.org/rhea/57036 does not have a matching chebi ID in the RDF file here but it has a ChEBI ID in the webpage.

Will update the gist with a better version relying on web parsing soon

hypnopump commented 1 year ago

Managed to create the script and parse the whole DB: parsed_rhea.json.zip

chemnlp_rhea_x_uniprot_parsing_share_ipynb.txt