ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
202 stars 128 forks source link

🦠 Model Request: Scaffold-Morphing #940

Closed Inyrkz closed 5 months ago

Inyrkz commented 8 months ago

Model Name

Scaffold Morphing

Model Description

The context discusses a novel notation system called Sequential Attachment-based Fragment Embedding (SAFE) that improves upon traditional molecular string representations like SMILES. SAFE reframes SMILES strings as an unordered sequence of interconnected fragment blocks while maintaining compatibility with existing SMILES parsers. This streamlines complex molecular design tasks by facilitating autoregressive generation under various constraints. The effectiveness of SAFE is demonstrated by training a GPT2-like model on a dataset of 1.1 billion SAFE representations that exhibited versatile and robust optimization performance for molecular design.

In scaffold morphing, we wish to replace a scaffold by another one in a molecule. The process requires as input that the user provides either the side chains or the input molecules and the core

Slug

safe-scaffold-morphing

Tag

Compound Generation

Publication

https://arxiv.org/pdf/2310.10773.pdf

Source Code

https://github.com/datamol-io/safe/tree/main

License

CC BY 4.0

Inyrkz commented 8 months ago

The model works locally.

This is the code sample

import safe

ibuprofen = "CC(Cc1ccc(cc1)C(C(=O)O)C)C"

# SMILES -> SAFE -> SMILES translation
try:
    ibuprofen_sf = safe.encode(ibuprofen)  # c12ccc3cc1.C3(C)C(=O)O.CC(C)C2
    ibuprofen_smi = safe.decode(ibuprofen_sf, canonical=True)  # CC(C)Cc1ccc(C(C)C(=O)O)cc1
except safe.EncoderError:
    pass
except safe.DecoderError:
    pass

ibuprofen_tokens = list(safe.split(ibuprofen_sf))
print(ibuprofen_tokens)

This is the output

['c', '1', '2', 'c', 'c', 'c', '3', 'c', 'c', '1', '.', 'C', '3', '(', 'C', ')', 'C', '(', '=', 'O', ')', 'O', '.', 'C', 'C', '(', 'C', ')', 'C', '2']
GemmaTuron commented 8 months ago

Hi @Inyrkz !

Good, you can load and run the basic SAFE commands. Datamol is usually very well organised, so hopefully downstream tasks will also be working nicely. This model has many many functionalities, as you can see in the graph they show in their repo. I need you to:

Inyrkz commented 8 months ago

Thank you for sharing the documentation. I'll check out other functions. It's an interesting model

miquelduranfrigola commented 8 months ago

Hi both,

100% agree with @GemmaTuron - let's check their notebooks and try to reproduce one of them. In my opinion, the most interesting is "scaffold morphing" as shown in the tutorial. Let's start by that. If that works, then we can actually create multiple models in the hub, for example: safe-scaffold-morphing, safe-scaffold-decoration, etc. But let's go step by step and start by one case.

To me, the most crucial part now is to reproduce the tutorial to show that the model is not too heavy and take note of performance, as Gemma suggests.

Inyrkz commented 8 months ago

Hi @miquelduranfrigola, @GemmaTuron,

I'll start with the "scaffold morphing." And also take note of the performance of the model.

Inyrkz commented 8 months ago

I've been stuck here (model downloading). My internet connection is poor. I'll try again later tonight

# Load pre-trained model
designer = sf.SAFEDesign.load_default(verbose=True)
designer.model
pytorch_model.bin:   0%|          | 0.00/349M [00:00<?, ?B/s]Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/87/f1/87f1a96893e9890db129b18dcc1e87833c610379b09e91b9d21277d8242d6205/d243f31837ec16c8d6653b8b18aa6225469fa6cbae6057ccaf93b46af46ca8a8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1705232429&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNTIzMjQyOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzg3L2YxLzg3ZjFhOTY4OTNlOTg5MGRiMTI5YjE4ZGNjMWU4NzgzM2M2MTAzNzliMDllOTFiOWQyMTI3N2Q4MjQyZDYyMDUvZDI0M2YzMTgzN2VjMTZjOGQ2NjUzYjhiMThhYTYyMjU0NjlmYTZjYmFlNjA1N2NjYWY5M2I0NmFmNDZjYThhOD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=P7Zc5wO4gMBRzwflroJDW%7EKGK-qBU1%7E7PUl5U65t14T9fuvT5c-VSXFTYiEaah1r6hlVdn9YPcCzxa8dhiQTWdcpgtQpVjSwcmp-VLLapsFx2HJQt-QEDDyAamQgZ0d2ezbwnpErg2ObpoVCz0ta4M%7ErBr2SuWdx3IAac-exi4EZ2G40HZDSuis7c7s8Id46LdEg3Qsdb1RbsVOoie752ZoT2mHnamaN9SFG5j%7EQe2OG0OEbS4To%7Eb-cXovbtRJ-7ScBY1keLzlqbGhywArqDryKQRUd2ldZlbzmPGvJXjcEMpNP6R2E9gTLiBSZi-YLGh1Lxxf0taqU0g2YD9ocLA__&Key-Pair-Id=KCD77M1F0VK2B: HTTPSConnectionPool(host='cdn-lfs-us-1.huggingface.co', port=443): Read timed out.
Trying to resume download...
Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/87/f1/87f1a96893e9890db129b18dcc1e87833c610379b09e91b9d21277d8242d6205/d243f31837ec16c8d6653b8b18aa6225469fa6cbae6057ccaf93b46af46ca8a8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1705232429&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNTIzMjQyOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzg3L2YxLzg3ZjFhOTY4OTNlOTg5MGRiMTI5YjE4ZGNjMWU4NzgzM2M2MTAzNzliMDllOTFiOWQyMTI3N2Q4MjQyZDYyMDUvZDI0M2YzMTgzN2VjMTZjOGQ2NjUzYjhiMThhYTYyMjU0NjlmYTZjYmFlNjA1N2NjYWY5M2I0NmFmNDZjYThhOD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=P7Zc5wO4gMBRzwflroJDW%7EKGK-qBU1%7E7PUl5U65t14T9fuvT5c-VSXFTYiEaah1r6hlVdn9YPcCzxa8dhiQTWdcpgtQpVjSwcmp-VLLapsFx2HJQt-QEDDyAamQgZ0d2ezbwnpErg2ObpoVCz0ta4M%7ErBr2SuWdx3IAac-exi4EZ2G40HZDSuis7c7s8Id46LdEg3Qsdb1RbsVOoie752ZoT2mHnamaN9SFG5j%7EQe2OG0OEbS4To%7Eb-cXovbtRJ-7ScBY1keLzlqbGhywArqDryKQRUd2ldZlbzmPGvJXjcEMpNP6R2E9gTLiBSZi-YLGh1Lxxf0taqU0g2YD9ocLA__&Key-Pair-Id=KCD77M1F0VK2B: HTTPSConnectionPool(host='cdn-lfs-us-1.huggingface.co', port=443): Read timed out.
Trying to resume download...

Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/87/f1/87f1a96893e9890db129b18dcc1e87833c610379b09e91b9d21277d8242d6205/d243f31837ec16c8d6653b8b18aa6225469fa6cbae6057ccaf93b46af46ca8a8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1705232429&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNTIzMjQyOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzg3L2YxLzg3ZjFhOTY4OTNlOTg5MGRiMTI5YjE4ZGNjMWU4NzgzM2M2MTAzNzliMDllOTFiOWQyMTI3N2Q4MjQyZDYyMDUvZDI0M2YzMTgzN2VjMTZjOGQ2NjUzYjhiMThhYTYyMjU0NjlmYTZjYmFlNjA1N2NjYWY5M2I0NmFmNDZjYThhOD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=P7Zc5wO4gMBRzwflroJDW%7EKGK-qBU1%7E7PUl5U65t14T9fuvT5c-VSXFTYiEaah1r6hlVdn9YPcCzxa8dhiQTWdcpgtQpVjSwcmp-VLLapsFx2HJQt-QEDDyAamQgZ0d2ezbwnpErg2ObpoVCz0ta4M%7ErBr2SuWdx3IAac-exi4EZ2G40HZDSuis7c7s8Id46LdEg3Qsdb1RbsVOoie752ZoT2mHnamaN9SFG5j%7EQe2OG0OEbS4To%7Eb-cXovbtRJ-7ScBY1keLzlqbGhywArqDryKQRUd2ldZlbzmPGvJXjcEMpNP6R2E9gTLiBSZi-YLGh1Lxxf0taqU0g2YD9ocLA__&Key-Pair-Id=KCD77M1F0VK2B: HTTPSConnectionPool(host='cdn-lfs-us-1.huggingface.co', port=443): Read timed out.
Trying to resume download...

While waiting, I'll study the documentation.

GemmaTuron commented 8 months ago

Hi Ini, sorry to hear that. Not great because it means the model will be pretty large in the Hub as well. Let me try it in my end see how long it takes

Inyrkz commented 8 months ago

I've been able to download the model. My internet is much better now.

pytorch_model.bin: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 349M/349M [02:51<00:00, 2.03MB/s] 
generation_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 132/132 [00:00<00:00, 3.84kB/s]
tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 46.3k/46.3k [00:00<00:00, 159kB/s]
SAFEDoubleHeadsModel(
  (transformer): GPT2Model(
    (wte): Embedding(1880, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
...
Inyrkz commented 8 months ago

This table shows how long it takes to execute the original code.

Tasks Execution Time (seconds)
Load model (first time) 134 (depends on internet speed)
Load model 55
De novo generation 193.1
Scaffold generation 156.1
Scaffold morphing 54.9
Super structure generation 32.9
Motif extension 25.7
Linker generation 79.8
GemmaTuron commented 8 months ago

Ok, so the model is fast but it just heavy so donwloading time will be the issue. That is helpful Ini! Let's focus on the scaffold morphing. As we discussed in the meeting it would be good to play around with it using structures that we know best. @miquelduranfrigola I suggested running a first test with the OSM compounds. The OSM project has been growing in the last years, but basically, there is a chemical series of interest (series 4) with a known core (triazolopyrazine). Could we try to use the scaffold morphing with that core? (in the docs they provide examples with side chains, and mention it can be done with cores instead but no examples) - could you try to look into the code to see how we could do this? And then compare the scaffold morphing to the decorator. As always, we can start working with Notebooks and once we have the results and decide what to exactly incorporate in the hub, we can work on the final model repo structure. I'll approve for the moment so you have a repo to put the code in

GemmaTuron commented 8 months ago

/approve

github-actions[bot] commented 8 months ago

New Model Repository Created! πŸŽ‰

@Inyrkz ersilia model respository has been successfully created and is available at:

πŸ”— ersilia-os/eos8bhe

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

Additional Resources πŸ“š

If you have any questions, please feel free to open an issue and get support from the community!

Inyrkz commented 8 months ago

Yes, the model is heavy to download.

I'll start working with the OSM compounds, and see how the scaffold morphing performs.

Inyrkz commented 8 months ago

Ok, so the model is fast but it just heavy so donwloading time will be the issue. That is helpful Ini! Let's focus on the scaffold morphing. As we discussed in the meeting it would be good to play around with it using structures that we know best. @miquelduranfrigola I suggested running a first test with the OSM compounds. The OSM project has been growing in the last years, but basically, there is a chemical series of interest (series 4) with a known core (triazolopyrazine). Could we try to use the scaffold morphing with that core? (in the docs they provide examples with side chains, and mention it can be done with cores instead but no examples) - could you try to look into the code to see how we could do this? And then compare the scaffold morphing to the decorator. As always, we can start working with Notebooks and once we have the results and decide what to exactly incorporate in the hub, we can work on the final model repo structure. I'll approve for the moment so you have a repo to put the code in

I'm still going through the documentation. I found this.

compute_side_chains(mol, core, label_by_index=False) [ΒΆ](https://safe-docs.datamol.io/stable/api/safe.html#safe.utils.compute_side_chains)
Compute the side chain of a molecule given a core

Finding the side chains

The algorithm to find the side chains from core assumes that the core we get as input has attachment points. Those attachment points are never considered as part of the query, rather they are used to define the attachment points on the side chains. Removing the attachment points from the core is exactly the same as keeping them.

mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
core0 = "CC1(C)CN2C(CC2=O)S1"
core1 = "CC1(C)SC2C(-*)C(=O)N2C1-*"
core2 = "CC1N2C(SC1(C)C)C(N)C2=O"
side_chain = compute_side_chain(core=core0, mol=mol)
dm.to_image([side_chain, core0, mol])
Therefore on the above, core0 and core1 are equivalent for the molecule mol, but core2 is not.

Do you think we are to convert the core structure to a side chain?

Inyrkz commented 8 months ago

Also, from this text in the documentation

Scaffold Morphing
In scaffold morphing, we wish to replace a scaffold by another one in a molecule. The process requires as input that the user provides either the side chains or the input molecules and the core

This means we need two inputs, the side chains (or input molecules) and the core. In the example they gave, they didn't input any core. I don't know if they are using any default core. I'm trying to find that part of the documentation (for the pretrained model)

miquelduranfrigola commented 8 months ago

OK this is extremely promising. Thanks @Inyrkz.

I will inspect the model too and get back to you with suggestions. For now, let's focus on placing the code accordingly in the model template structure and, meanwhile, we can come up ideas.

On a related note, as previously discussed, it is likely that we will use this model in different modalities (scaffold morphing, linker, etc.). Therefore, I would change the slug to something like: safe-scaffold-morphing. Probably, in the near future, we'll do safe-linker, safe-de-novo etc. I hope this makes sense.

Inyrkz commented 8 months ago

Ok, so the model is fast but it just heavy so donwloading time will be the issue. That is helpful Ini! Let's focus on the scaffold morphing. As we discussed in the meeting it would be good to play around with it using structures that we know best. @miquelduranfrigola I suggested running a first test with the OSM compounds. The OSM project has been growing in the last years, but basically, there is a chemical series of interest (series 4) with a known core (triazolopyrazine). Could we try to use the scaffold morphing with that core? (in the docs they provide examples with side chains, and mention it can be done with cores instead but no examples) - could you try to look into the code to see how we could do this? And then compare the scaffold morphing to the decorator. As always, we can start working with Notebooks and once we have the results and decide what to exactly incorporate in the hub, we can work on the final model repo structure. I'll approve for the moment so you have a repo to put the code in

I'm still going through the documentation. I found this.

compute_side_chains(mol, core, label_by_index=False) [ΒΆ](https://safe-docs.datamol.io/stable/api/safe.html#safe.utils.compute_side_chains)
Compute the side chain of a molecule given a core

Finding the side chains

The algorithm to find the side chains from core assumes that the core we get as input has attachment points. Those attachment points are never considered as part of the query, rather they are used to define the attachment points on the side chains. Removing the attachment points from the core is exactly the same as keeping them.

mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
core0 = "CC1(C)CN2C(CC2=O)S1"
core1 = "CC1(C)SC2C(-*)C(=O)N2C1-*"
core2 = "CC1N2C(SC1(C)C)C(N)C2=O"
side_chain = compute_side_chain(core=core0, mol=mol)
dm.to_image([side_chain, core0, mol])
Therefore on the above, core0 and core1 are equivalent for the molecule mol, but core2 is not.

Do you think we are to convert the core structure to a side chain?

@GemmaTuron @miquelduranfrigola Both scaffold morphing and scaffold decoration require using side chains. To compute the side chains using the compute_side_chains(mol, core, label_by_index=False) function, we need both the SMILES (mol) and the core (core0) structure (core0). I don't know how we are going to extract the core structure from the SAFE after converting the SMILES to SAFE.

My code can't run without an error without resolving this first. I'll push what I've written so far.

GemmaTuron commented 8 months ago

Thanks @Inyrkz for the update. Good work, we will use tomorrow's meeting to discuss how to proceed with this. Push the code so I can have a look and prepare! Thanks

Inyrkz commented 8 months ago

This is the link to the notebook

miquelduranfrigola commented 8 months ago

On a quick look, I think the "core" can be obtained with MurckoScaffolds. But please double-check

Inyrkz commented 8 months ago

I found a documentation here. I'll try this.

Inyrkz commented 8 months ago

This is the function I wrote.

def extract_core_structure(molecule):
    # Convert SMILES to RDKit molecule
    mol = Chem.MolFromSmiles(molecule)

    # Identify the core structure using the Murcko Scaffold
    core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(mol)

    # Convert the core structure back to SMILES
    core_smiles = Chem.MolToSmiles(core)

    return core_smiles

mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
core_structure = extract_core_structure(mol)
print("Core Structure:", core_structure)

This is the output: Core Structure: O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1

Do you think this core structure is correct? I'll try to visualize it in marvinjs.

GemmaTuron commented 8 months ago

Thanks Ini! think this is still too large, maybe we can try recursively applying the MurckoScaffold to this structure once more see if it trims it down? Otherwise exploring the MolFragments (I've left some code on the other model request issue) can be useful. Finally, we can add the binding sites with RDKIT as well (check the other issue where I left an example)

Inyrkz commented 8 months ago

Alright πŸ‘πŸΌ

I'll check it out and try it.

On Fri, Jan 19, 2024, 10:09 AM gemmaturon @.***> wrote:

Thanks Ini! think this is still too large, maybe we can try recursively applying the MurckoScaffold to this structure once more see if it trims it down? Otherwise exploring the MolFragments (I've left some code on the other model request issue) can be useful. Finally, we can add the binding sites with RDKIT as well (check the other issue where I left an example)

β€” Reply to this email directly, view it on GitHub https://github.com/ersilia-os/ersilia/issues/940#issuecomment-1900025263, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANGRFV5AP5DFR5VJJ7GS7DLYPIZ47AVCNFSM6AAAAABBTZBMRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBQGAZDKMRWGM . You are receiving this because you were mentioned.Message ID: @.***>

miquelduranfrigola commented 8 months ago

Thanks both. Good stuff. Unfortunately, I don't think recursively using Murcko will further prune the molecule (I agree it is too big). The MolFragments sounds like a reasonable second step. So: Murcko+MolFragments sounds good. One question that will arise is: of all fragments, which one is the core? This question is very difficult to answer, so for now, let's just take the largest fragment generated (for example, the one with the highest numbre of atoms). We may have to explore this manually, but as a starting point I think it makes sense. Opinions?

Inyrkz commented 8 months ago

I tried this code

mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
mol = Chem.MolFromSmiles(mol)
core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(mol)
fragments = Chem.GetMolFrags(core, asMols=True)
for i in fragments:
    core_smiles = Chem.MolToSmiles(i)
    print(core_smiles)

and I got this output O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1

I was expecting more than one value from the GetMolFrags() function.

Inyrkz commented 8 months ago

I tried this code to get the molecule with the highest number of atoms.

mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
mol = Chem.MolFromSmiles(mol)
core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(mol)
fragments = Chem.GetMolFrags(core, asMols=True)

# Select the largest fragment based on the number of atoms
largest_fragment = max(fragments, key=lambda x: x.GetNumAtoms())

# Convert the largest fragment back to SMILES
core_smiles = Chem.MolToSmiles(largest_fragment)
print(core_smiles)

I got this output O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1

I think it's because it's only one fragment that the GetMolFrags() function gives. I'm going to try this on another sample molecule.

Inyrkz commented 8 months ago

Using this molecule mol = "N#Cc1ccc(-c2nnc3cncc(OCCc4ccc(F)c(F)c4)n23)cc1" I get this as the output for the two code cells: c1ccc(CCOc2cncc3nnc(-c4ccccc4)n23)cc1

I'm guessing this is still too large.

Inyrkz commented 8 months ago

This molecule molecule = "FC1=CC=C(CCOC2=CN=CC3=NN=C(N23)C2=CC=C(C(OC3CC3)=C2)C2=CC(OC3CC3)=C(Cl)C=C2)C=C1F" gives this as output c1ccc(CCOc2cncc3nnc(-c4ccc(-c5cccc(OC6CC6)c5)c(OC5CC5)c4)n23)cc1

miquelduranfrigola commented 8 months ago

Thanks @Inyrkz - awesome stuff. This is really going in the right direction. Can you please take a look at this page? https://portal.valencelabs.com/datamol/post/generate-scaffolds-iBUTqU8Im9N2zCM

GemmaTuron commented 8 months ago

Also, @Inyrkz regarding the output of MolFrags:

I think we might not be getting more MolFrags because we are applying it to a MurckScaffold. what if we do the other way around, MolFrags and then MurckoScaffold?

Inyrkz commented 8 months ago

Thanks @Inyrkz - awesome stuff. This is really going in the right direction. Can you please take a look at this page? https://portal.valencelabs.com/datamol/post/generate-scaffolds-iBUTqU8Im9N2zCM

@miquelduranfrigola this looks interesting. I'll go through it.

Inyrkz commented 8 months ago

Also, @Inyrkz regarding the output of MolFrags:

I think we might not be getting more MolFrags because we are applying it to a MurckScaffold. what if we do the other way around, MolFrags and then MurckoScaffold?

@GemmaTuron okay, I'll try this to see the difference.

Inyrkz commented 8 months ago

@GemmaTuron

Using this code

molecule = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
mol = Chem.MolFromSmiles(molecule)
fragments = Chem.GetMolFrags(mol, asMols=True)

for i in fragments:
    print("Fragment:",  Chem.MolToSmiles(i))
    core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(i)
    print("MurckoScaffold:", Chem.MolToSmiles(core))

This is the result I get

Fragment: Cc1onc(-c2ccccc2Cl)c1C(=O)NC1C(=O)N2C1SC(C)(C)C2C(=O)O
MurckoScaffold: O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1

It's still only one fragment generated from the input SMILES. But this time, the fragment and core are different.

Using this molecule molecule = "FC1=CC=C(CCOC2=CN=CC3=NN=C(N23)C2=CC=C(C(OC3CC3)=C2)C2=CC(OC3CC3)=C(Cl)C=C2)C=C1F" we get

Fragment: Fc1ccc(CCOc2cncc3nnc(-c4ccc(-c5ccc(Cl)c(OC6CC6)c5)c(OC5CC5)c4)n23)cc1F
MurckoScaffold: c1ccc(CCOc2cncc3nnc(-c4ccc(-c5cccc(OC6CC6)c5)c(OC5CC5)c4)n23)cc1

It's only the Fluorine atom that is removed on both ends of the fragment.

Inyrkz commented 8 months ago

@miquelduranfrigola

It's a great article. Thanks, it was helpful. The diagram in the article shows how the scaffolds are broken down from level 4 to level 1 to get the main core structure.

They showed two different ways of extracting the scaffold. We can use the RDKit or DataMol package.

Inyrkz commented 8 months ago

I used the code from the article on the previously mentioned molecule. It still gives the same result.

I also tried to apply the code recursively. Theoretically, this should work like they showed in the diagram, but it doesn't. It still gives the same scaffold as the result. It doesn't break it down any further. The same thing applies to other molecules.

The notebook is here.

GemmaTuron commented 8 months ago

I see, it is a challenging issue! I also have another suggestion, to ask the user to pass a subset of molecules from the same family, and then use the FindMCS function from rdkit to find the core (maybe you can try to get a few molecules from the OSM dataset and see what the MCS is for those @Inyrkz ) What do you think @miquelduranfrigola ?

Inyrkz commented 8 months ago

So instead of getting just one molecule and finding the core structure of the molecule, the user will pass a subset of related molecules (because they share a common core structure). Then we use the FindMCS() function to extract the common substructure that represents the core of the chemical series.

Inyrkz commented 8 months ago

This is the notebook for it.

miquelduranfrigola commented 8 months ago

OK, thanks @Inyrkz and @GemmaTuron, this is all very interesting. I would strongly try to use only one molecule for now. In this particular case, I'd rather recommend the user to just pass the core. At the end of the day, if they want to morph a particular core, then why shouldn't they pass the core directly? I think this is reasonable.

So, in practical terms, I would do the following:

So, in case the user already inputs a core or scaffold, it will be OK and, in case they input a molecule with side chains, we'll do our "best effort" to break it up. In my opinion, this is more than reasonable.

Would you agree?

Inyrkz commented 8 months ago

Hi @miquelduranfrigola,

That will be fine if we want the users to input just the core. We'll convert the core they input to a side chain and then convert it to a SAFE before passing it to the model (or convert it to a SAFE first before converting it to a side chain).

If the user inputs a molecule, the DataMol code can only get up to level 4 (which is one large scaffold).

Should I continue with using the scaffold from level 4?

GemmaTuron commented 8 months ago

I would not go with the startegy of large scaffolds as the molecules obtained will not be that significant. We can also try what happens if we pass the core as a SMILES string (without the attachment points * ) and we simply try to create them with RDKIT, so the user only needs to input the core smiles -- would that work?

miquelduranfrigola commented 8 months ago

OK, I think it's all good ideas. Should we talk about it on Thursday?

Inyrkz commented 8 months ago

@GemmaTuron, that could work for scaffold decoration. We just focus on converting the core molecule to a side chain, and pass that to the model.

I don't think it will work for scaffold morphing. The users will have to input both the molecule and the core.

GemmaTuron commented 8 months ago

you are right @Inyrkz .. Get as far as you can with the different options and on Thursday we will discuss them throughly to adopt the best one - the notebooks you are preparing are very useful, we'll discuss them all. If you get to a point where you cannot proceed further, start looking into the Virtual Libraries (Mollib) as this will be our next model to work on

miquelduranfrigola commented 8 months ago

Thanks @GemmaTuron - completely agree with exploring Virtual Libraries (Mollib) meanwhile

GemmaTuron commented 8 months ago

Links: https://pubs.acs.org/doi/10.1021/acs.jcim.0c00296 https://rdkit.org/docs/source/rdkit.Chem.Scaffolds.rdScaffoldNetwork.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080744/ https://www.zbh.uni-hamburg.de/en/forschung/amd/datasets/brics.html

Inyrkz commented 8 months ago

The rdScaffoldNetwork: The Scaffold Network Implementation in RDKit works. It breaks the scaffolds into different levels. The number of scaffolds depends on the molecule size.

This is the notebook implementation. In some molecules, the attachment points are added.

We need to figure out which scaffold will be our core structure, especially for the last code cell in the notebook.

GemmaTuron commented 8 months ago

Thanks Ini, that is very interesting. To start with, I would discard the fragments that do not show attachment points (*) so that at least the core we select already has them. We could then filter out by MW, can you add the MW to the structures you are getting so we can establish a cut-off (probably a high and low, we don't want things that are too small or too large)... and then the question remains of whether we want only 1 core per molecule or we would pass up to three cores per molecule, for example.

Inyrkz commented 8 months ago

@GemmaTuron, Okay. I tried using the FindMCS function to find the maximum common substructure from the generated list, but it didn't work.

I'll try this.