Rose-STL-Lab / LIMO

generative model for drug discovery
59 stars 14 forks source link

Substructure Optimisation: Mols don't maintain substructure #10

Closed MarieOestreich closed 11 months ago

MarieOestreich commented 1 year ago

Hi ! I tried to run the optimisation procedure with the code you provided in this issue on the first molecule and its substructure as you provided here. However, none of the molecules seem to have maintained the substructure. I will attach the exact code I am running. Am I overlooking something?

Thanks in advance !

def create_mask(smile, substructure):
  orig_z = smiles_to_z([smile], vae)

  orig_x = torch.exp(vae.decode(orig_z))
  substruct = Chem.MolFromSmiles(substructure)
  selfies = list(sf.split_selfies(sf.encoder(smile)))
  mask = torch.zeros_like(orig_x)
  for i in range(len(selfies)):
    for j in range(len(dm.dataset.idx_to_symbol)):

      changed = selfies.copy()

      changed[i] = dm.dataset.idx_to_symbol[j]
      m = Chem.MolFromSmiles(sf.decoder(''.join(changed)))
      if not m.HasSubstructMatch(substruct):

      mask[0][i * len(dm.dataset.idx_to_symbol) + j] = 1
  return mask, orig_z, orig_x

mask, orig_z, orig_x = create_mask(smile='CCCC1=NN(C)C(NC(CN2C(NC3(C2=O)CCC(CC3)C)=O)=O)=C1', substructure='O=CNC1=CC=NN1C')

z = orig_z.clone().detach().requires_grad_(True)
optimizer = torch.optim.Adam([z], lr=0.1)
smiles = []
logps = []
for epoch in tqdm(range(50000)): # 50000
    optimizer.zero_grad()
    x = torch.exp(vae.decode(z))
    loss = model(x) + 1000 * torch.sum(((x - orig_x.clone().detach()) * mask) ** 2)
    loss.backward()
    optimizer.step()
    if epoch % 1000 == 0:
        # x, logp = get_logp(z)
        # logps.append(logp.item())
        smiles.append(one_hot_to_smiles(x))

for s in smiles:
    m = Chem.MolFromSmiles(s)
    substruct = Chem.MolFromSmiles('O=CNC1=CC=NN1C') 
    print(m.HasSubstructMatch(substruct))
PeterEckmann1 commented 1 year ago

Hi,

I just tried the code, and it isn't working for me as well. The problem seems to be that the molecule can't be represented by the latent space of the VAE. I believe the version of vae.pt I provided was different from the version used in the paper, so it makes sense that the latent spaces are slightly different (although this effect should be negligible for the molecular generation parts).

I'm not sure if your goal is to reproduce the results for exactly these molecules, in which case I would suggest retraining the VAE (perhaps multiple times, if needed) until the molecule is in the latent space. I can write a script for this if you'd like. Otherwise, you could try other molecules and see which ones work for the vae.pt that I uploaded.