ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 123 forks source link

🦠 Model Request: Predict bioactivity against Main Protease of SARS-CoV-2 #1180

Open HarmonySosa opened 1 week ago

HarmonySosa commented 1 week ago

Model Name

Predict bioactivity against Main Protease of SARS-CoV-2

Model Description

MProPred predicts the efficacy of compounds against the main protease of SARS-CoV-2, which is a promising drug target since it processes polyproteins of SARS-CoV-2. This model uses PaDEL-Descriptor to calculate molecular descriptors of compounds. It is based on a dataset of 758 compounds that have inhibition efficacy against the Main Protease, as published in peer-reviewed journals between January, 2020 and August, 2021. Input compounds are compared to compounds in the dataset to measure molecular similarity with MACCS.

Slug

mpro-covid19

Tag

COVID19

Publication

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10289339/

Source Code

https://github.com/Nadimfrds/Mpropred

License

MIT

GemmaTuron commented 1 week ago

Hi @HarmonySosa

Good start, a couple comments before we approve the request: I think the tags will not work as they are not from the approved list (in GitBook) - they are python-based so strings need to match Slugs also have a word limit, something like mpro-covid19 would be better

Can you modify those fields before we approve the model?

GemmaTuron commented 4 days ago

/approve

github-actions[bot] commented 4 days ago

New Model Repository Created! 🎉

@HarmonySosa ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos3nn9

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

GemmaTuron commented 4 days ago

Hi @HarmonySosa !

This model is using PADEL Descriptors to calculate MACCS Fingerprints. In our experience, the PADEL package is not very well integrated with Python and can bring problems. Can you try to see if the MACCS fingerprints we obtain with RDKIT (MACCS Keys) are the same as the ones we obtain with the MPro Predictor.

It should be something like.. (making the function up, look for the right one)

from rdkit.Chem import MACCSKeys
maccskeys = [MACCSKeys(smi) for smi in smiles_list]

This will allow us to modify the calculate descriptors function:

# Molecular descriptor calculator option
    def desc_calc():
        # Performs the descriptor calculation
        bashCommand = "java -Xms2G -Xmx2G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/MACCSFingerprinter.xml -dir ./ -file descriptors_output.csv"
        process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
        output, error = process.communicate()
        os.remove('molecule.smi')
HarmonySosa commented 2 days ago

Hi @GemmaTuron!

Here are the desc_calc and build_model functions using PADEL:

def desc_calc():
    # Performs the descriptor calculation
    bashCommand = "java -Xms2G -Xmx2G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/MACCSFingerprinter.xml -dir ./ -file descriptors_output.csv"
    process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
    output, error = process.communicate()
    os.remove('molecule.smi')

def build_model(input_data):
    # Reads in saved regression model
    load_model = pickle.load(open('Mpro_model.pkl', 'rb')) #  06.25.24 broken up to handle dtype error
    # Apply model to make predictions
    prediction = load_model.predict(input_data)
    st.header('**Prediction results**')
    prediction_output = pd.Series(prediction, name='pIC50')
    molecule_name = pd.Series(load_data[1], name='molecule_name')
    df = pd.concat([molecule_name, prediction_output], axis=1)
    st.write(df)
    st.markdown(filedownload(df), unsafe_allow_html=True)

These are the results I get when I run the model with PADEL: PADEL_Results

This is how I have been trying to use RDKit, but I get different results:

Convert RDKit bit vector to a list of ints

def bitvector_to_list(bitvector):
    return [int(bit) for bit in bitvector]

def calculate_maccs_keys(smiles_list):
    maccs_keys = []
    for smi in smiles_list:
        mol = Chem.MolFromSmiles(smi)
        if mol is not None:
            maccs_key = MACCSkeys.GenMACCSKeys(mol)
            maccs_keys.append(bitvector_to_list(maccs_key))
        else:
            maccs_keys.append([0]*167)  # MACCS keys are 167 bits long
    return maccs_keys

def desc_calc(smiles_list, output_file='descriptors_output.csv'):
    # Calculate MACCS fingerprints using RDKit
    maccs_keys = calculate_maccs_keys(smiles_list)

     # Create a DataFrame and name columns appropriately, save to CSV
    df = pd.DataFrame(maccs_keys, columns=[f'MACCSFP{i}' for i in range(167)]) 
    df.to_csv(output_file, index=False)
    return df

# Model building section
def build_model(input_data):
    # Reads in saved regression model
    load_model = pickle.load(open('Mpro_model.pkl', 'rb')) #  06.25.24 dtype error
    # Apply model to make predictions
    prediction = load_model.predict(input_data)
    st.header('**Prediction results**')
    prediction_output = pd.Series(prediction, name='pIC50')
    molecule_name = pd.Series(load_data[1], name='molecule_name')
    df = pd.concat([molecule_name, prediction_output], axis=1)
    st.write(df)
    st.markdown(filedownload(df), unsafe_allow_html=True)

These are the results I get when I use RDKit: RDKit_Results

GemmaTuron commented 1 day ago

Hi @HarmonySosa I am trying to reporduce the results but just with the molecule_name I cannot get the smiles. Where did you get the molecules from?

GemmaTuron commented 1 day ago

mm in any case it does seem the rdkit implementation and the PADEL descriptors differ slightly, quite surprising as the MACCS Keys are just substructure searchers in a way. In any case, it must be due to the preprocessing that PADEL does vs the preprocessing that rdkit does. We can go ahead and use PADEL in the model, guess by just keeping the folder there it should work