ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
225 stars 147 forks source link

🐈 Task: REINVENT models reformat output #1246

Open miquelduranfrigola opened 2 months ago

miquelduranfrigola commented 2 months ago

Summary

Some (or all of) the REINVENT models in the Ersilia Model Hub have an unconventional output in JSON format, mainly because there is an outcome header in the service.py file. We need to give the output in tabular format and fill in the missing gaps with None.

Also importantly, some of the returned SMILES are labelled for some reason. We want to get rid of this labeling plus, ideally, we want to standardise the smiles and return a unique set (perhaps ordered by tanimoto similarity).

In summary, we need to work a little bit more on these models to have a more standard output.

Objective(s)

A more standard output (tabular format) for the REINVENT models.

Documentation

Here is how we can remove atom labels and standardise using RDKit and the standardiser library:

from rdkit import Chem
from standardiser import standardise

def remove_atom_map_labels(smiles):
    mol = Chem.MolFromSmiles(smiles)
    for atom in mol.GetAtoms():
        atom.SetAtomMapNum(0)
    return Chem.MolToSmiles(mol)

def standardise_mol(mol):
    try:
        mol = standardise.run(mol)
        return mol
    except:
        return None
GemmaTuron commented 2 months ago

Hi @miquelduranfrigola

Did you work on this for the workshop in Ghana? If not, should we?

miquelduranfrigola commented 2 months ago

I worked on this partially and I solved it to make it work for the workshop. I did not close the issue because we need to test it with every REINVENT model to be 100% sure. What priority should we give to it?

GemmaTuron commented 2 months ago

I would do it in the next Chem Sampler sprint, I am marking it with the tags