Encoding new unseen molecules

manavsingh415 commented 3 years ago

Hi. When trying to create 512 dimensional vector representations of some new molecules (that the encoder may not have seen during training), I get the following error

Traceback (most recent call last): File "encode.py", line 56, in encode(**args) File "encode.py", line 35, in encode latent = model.transform(model.vectorize(mols_in)) File "/content/latent-gan/ddc_pub/ddc_v3.py", line 1042, in vectorize return self.smilesvec1.transform(mols_test) File "/content/latent-gan/molvecgen/vectorizers.py", line 145, in transform one_hot[i,j+offset,charidx] = 1 IndexError: index -201 is out of bounds for axis 1 with size 138

I am using the pretrained chembl encoder. Any ideas about how to resolve? Thanks

muammar commented 2 years ago

Did you find a solution to this?

muammar commented 2 years ago

Because they explicitly mention in the README that the token length limit is 128, I decided to use SmilesVectorizer from molvecgen. I removed all SMILES for which the token vector has a length larger than the limit.

Suppose your data frame is called data in the example below.

remove = []

TOKEN_LENGTH_LIMIT = 128

for index, row in tqdm(data.iterrows(), total=len(data)):
    mol = Chem.MolFromSmiles(row.SMILES)
    sm_en = SmilesVectorizer(canonical=True, augment=False)
    sm_en.fit([mol], extra_chars=["\\"])

    if sm_en.maxlength > TOKEN_LENGTH_LIMIT:
        remove.append(index)

print(
    f"There are {len(remove)} smiles with a token length larger than {TOKEN_LENGTH_LIMIT}"
)

data.drop(remove, inplace=True)
data.to_csv("preprocessed.csv", index=False, header=False)

And now it worked.

The other way will be that if too many molecules are discarded because they have a token length larger than 128, you retrain the autoencoder again.

Good luck.

Dierme / latent-gan

Encoding new unseen molecules #11