BCCDC-DSI / RADD

Consult at the BCCDC for Mass Spectrometry
MIT License
1 stars 0 forks source link

Use `smiles2vec` to encode the SMILES data for the ML Pipeline #24

Open Afraz496 opened 1 week ago

Afraz496 commented 1 week ago

Investigate some other libraries if need be

Afraz496 commented 3 days ago

Current approach is making use of the encoder in smiles2vec:

def vectorize_smiles(smiles):
    """
    Vectorize a list of SMILES strings into one-hot encoded representations.

    This function converts a list of SMILES strings into a three-dimensional
    one-hot encoded array, suitable for input into neural network models. It
    automatically determines the maximum SMILES length in the dataset and adds
    two additional positions for start ('!') and end ('E') characters.

    Parameters
    ----------
    smiles : list of str
        List of SMILES strings to be vectorized.

    Returns
    -------
    tuple of np.ndarray
        A tuple containing two numpy arrays:
        - The first array is the one-hot encoded input sequences, excluding the end character.
        - The second array is the one-hot encoded output sequences, excluding the start character.

    Examples
    --------
    >>> smiles = ["CCO", "NCC", "CCCCCCCCCCC"]
    >>> X, Y = vectorize_smiles(smiles)
    >>> X.shape
    (3, 13, 27)
    >>> Y.shape
    (3, 13, 27)
    Note: for this pipeline Y is a return value we do not utilise.
    """
    char_to_int = create_char_to_int(smiles)

    # Determine the maximum SMILES length
    max_smiles_length = max(len(smile) for smile in smiles)
    embed_length = max_smiles_length + 2  # Add 2 for start ('!') and end ('E') characters
    charset_size = len(char_to_int)

    def vectorize(smiles):
        one_hot = np.zeros((len(smiles), embed_length, charset_size), dtype=np.int8)
        for i, smile in enumerate(smiles):
            # encode the start character
            one_hot[i, 0, char_to_int["!"]] = 1
            # encode the rest of the characters
            for j, c in enumerate(smile):
                if c in char_to_int:
                    one_hot[i, j + 1, char_to_int[c]] = 1
                else:
                    one_hot[i, j + 1, char_to_int['UNK']] = 1
            # encode end character
            one_hot[i, len(smile) + 1, char_to_int["E"]] = 1
        # return two, one for input and the other for output
        return one_hot[:, 0:-1, :], one_hot[:, 1:, :]

    return vectorize(smiles)

The biggest caveat with this approach is it acts like a modifier (so it is currently also 'transforming' the test set)

We need to find a way to make this an encoder to use separately on the test.