Create more molecular captioning examples

kjappelbaum commented 1 year ago

One dataset I'd also like to add is some translation/description dataset based on Enamine (i.e. a really large version of the molecular captioning task). We could get some basic descriptors of the molecules relatively easily using RDKit and learning this from the SMILES will require the model to learn at least some basic concepts on how the periodic table and SMILES work. Since Enamine is so huge, we should start writing the pipeline for this soon.

https://discord.com/channels/850068776544108564/1080848065914753185/1087730198419607592

agitter commented 1 year ago

If the goal is to go big, you could also consider sourcing the molecules from ZINC-22 instead of only Enamine. They have ~37 billion make on demand molecules from Enamine, WuXi and Mcule.

agitter commented 1 year ago

This could be even easier to access through Virtual Flow 2.0: https://www.biorxiv.org/content/10.1101/2023.04.25.537981v1

They already enumerated Enamine REAL Space and make the library available in PDB, PDBQT, MOL2, SDF, SMILES, SELFIES, and Parquet formats. They also calculated 18 molecular properties:

molecular weight, logP, hydrogen bond donor count, hydrogen bond acceptor count, rotatable bond count, topological polar surface area (TPSA), logS, aromatic ring count, molecular refractivity (MR), formal charge, positive charge count, negative charge count, fsp3, chiral center count, halogen atom count, sulfur atom count, and stereoisomer count

Their enumeration process expands the original 31.5B compounds to 68.7B.

OpenBioML / chemnlp

Create more molecular captioning examples #147