explicit_hydrogens in AdjGraphTransformer has no effect

flogrammer commented 11 months ago

Is there an existing issue for this?

[x] I have searched the existing issues and found nothing

Bug description

When using the argument explicit_hydrogens (according to dgllife documentation) H's should be added to the featurized graph. It seems however, that this argument currently has no effect on the output.

Even if a already explicit SMILES string is passed, such as [H]OC([H])([H])[H], the H's are removed.

How to reproduce the bug

from molfeat.calc.bond import EdgeMatCalculator
from molfeat.calc.atom import AtomCalculator
from molfeat.trans.graph import AdjGraphTransformer

featurizer = AdjGraphTransformer(atom_featurizer=AtomCalculator(),
                                 bond_featurizer=EdgeMatCalculator(),
                                 explicit_hydrogens=True,
                                 self_loop=False,
                                 canonical_atom_order=True,
                                 dtype=torch.float)

featurizer("CO")
# Expected: 6 featurized atoms (4 H's and C, O)
# Actual: 2 featurized atoms

Error messages and logs

No response

Environment

Current environment

``` #- Molfeat version: 0.8.6 #- PyTorch Version: 1.12.0 #- RDKit version: 2022.09.4 ```

Additional context

No response

flogrammer commented 11 months ago

I digged a bit through the code and found that the H's are only added in the process function, which is however never called.

As a workaround I found this to work (manually calling preprocess): featurizer(featurizer.preprocess(["[H]OC([H])([H])[H]"])[0][0])

maclandrol commented 11 months ago

Hi @flogrammer,

Featurizer classes have a preprocess method, with an implementation of the recommended preprocessing expected on a dataset given how thefeaturizer has been initialized. However, as you have noted, this function isn't applied automatically. We made this decision thoughtfully to avoid unwanted behaviors that could arise from modifying user input structures.

Your workaround is correct. And should work without needing to add hydrogens to the SMILES explicitely:

import torch
from molfeat.calc.bond import EdgeMatCalculator
from molfeat.calc.atom import AtomCalculator
from molfeat.trans.graph import AdjGraphTransformer

featurizer = AdjGraphTransformer(atom_featurizer=AtomCalculator(),
                                 bond_featurizer=EdgeMatCalculator(),
                                 explicit_hydrogens=True,
                                 self_loop=False,
                                 canonical_atom_order=True,
                                 dtype=torch.float)

mols = ["CO"]
mols,  _ = featurizer.preprocess(mols)
out = featurizer(mols)
out[0][0]  # torch.Size([6, 6])

Alternatively since molfeat supports RDKit molecule object as input, you can add the hydrogen yourself first.

import datamol as dm
mols = ["CO"]
mols = [dm.to_mol(x, add_hs=True) for x in mols]
mols,  _ = featurizer.preprocess(mols)
out = featurizer(mols) 
out[0][0].shape # torch.Size([6, 6])

In #70 we are adding clarifications. However, if the community prefers automatic preprocessing, we are open to enforcing that in the next release. Your feedback is valuable to us, and we aim to provide the best user experience possible.

flogrammer commented 11 months ago

Great, thanks for clarifying and your fast response!

datamol-io / molfeat