Open Afraz496 opened 1 week ago
Current approach is making use of the encoder in smiles2vec
:
def vectorize_smiles(smiles):
"""
Vectorize a list of SMILES strings into one-hot encoded representations.
This function converts a list of SMILES strings into a three-dimensional
one-hot encoded array, suitable for input into neural network models. It
automatically determines the maximum SMILES length in the dataset and adds
two additional positions for start ('!') and end ('E') characters.
Parameters
----------
smiles : list of str
List of SMILES strings to be vectorized.
Returns
-------
tuple of np.ndarray
A tuple containing two numpy arrays:
- The first array is the one-hot encoded input sequences, excluding the end character.
- The second array is the one-hot encoded output sequences, excluding the start character.
Examples
--------
>>> smiles = ["CCO", "NCC", "CCCCCCCCCCC"]
>>> X, Y = vectorize_smiles(smiles)
>>> X.shape
(3, 13, 27)
>>> Y.shape
(3, 13, 27)
Note: for this pipeline Y is a return value we do not utilise.
"""
char_to_int = create_char_to_int(smiles)
# Determine the maximum SMILES length
max_smiles_length = max(len(smile) for smile in smiles)
embed_length = max_smiles_length + 2 # Add 2 for start ('!') and end ('E') characters
charset_size = len(char_to_int)
def vectorize(smiles):
one_hot = np.zeros((len(smiles), embed_length, charset_size), dtype=np.int8)
for i, smile in enumerate(smiles):
# encode the start character
one_hot[i, 0, char_to_int["!"]] = 1
# encode the rest of the characters
for j, c in enumerate(smile):
if c in char_to_int:
one_hot[i, j + 1, char_to_int[c]] = 1
else:
one_hot[i, j + 1, char_to_int['UNK']] = 1
# encode end character
one_hot[i, len(smile) + 1, char_to_int["E"]] = 1
# return two, one for input and the other for output
return one_hot[:, 0:-1, :], one_hot[:, 1:, :]
return vectorize(smiles)
The biggest caveat with this approach is it acts like a modifier (so it is currently also 'transforming' the test
set)
We need to find a way to make this an encoder to use separately on the test.
Investigate some other libraries if need be