Closed flogrammer closed 11 months ago
I digged a bit through the code and found that the H's are only added in the process function, which is however never called.
As a workaround I found this to work (manually calling preprocess): featurizer(featurizer.preprocess(["[H]OC([H])([H])[H]"])[0][0])
Hi @flogrammer,
Featurizer
classes have a preprocess
method, with an implementation of the recommended preprocessing expected on a dataset given how thefeaturizer
has been initialized. However, as you have noted, this function isn't applied automatically. We made this decision thoughtfully to avoid unwanted behaviors that could arise from modifying user input structures.
Your workaround is correct. And should work without needing to add hydrogens to the SMILES explicitely:
import torch
from molfeat.calc.bond import EdgeMatCalculator
from molfeat.calc.atom import AtomCalculator
from molfeat.trans.graph import AdjGraphTransformer
featurizer = AdjGraphTransformer(atom_featurizer=AtomCalculator(),
bond_featurizer=EdgeMatCalculator(),
explicit_hydrogens=True,
self_loop=False,
canonical_atom_order=True,
dtype=torch.float)
mols = ["CO"]
mols, _ = featurizer.preprocess(mols)
out = featurizer(mols)
out[0][0] # torch.Size([6, 6])
Alternatively since molfeat
supports RDKit
molecule object as input, you can add the hydrogen yourself first.
import datamol as dm
mols = ["CO"]
mols = [dm.to_mol(x, add_hs=True) for x in mols]
mols, _ = featurizer.preprocess(mols)
out = featurizer(mols)
out[0][0].shape # torch.Size([6, 6])
In #70 we are adding clarifications. However, if the community prefers automatic preprocessing, we are open to enforcing that in the next release. Your feedback is valuable to us, and we aim to provide the best user experience possible.
Great, thanks for clarifying and your fast response!
Is there an existing issue for this?
Bug description
When using the argument explicit_hydrogens (according to dgllife documentation) H's should be added to the featurized graph. It seems however, that this argument currently has no effect on the output.
Even if a already explicit SMILES string is passed, such as [H]OC([H])([H])[H], the H's are removed.
How to reproduce the bug
Error messages and logs
No response
Environment
Current environment
``` #- Molfeat version: 0.8.6 #- PyTorch Version: 1.12.0 #- RDKit version: 2022.09.4 ```Additional context
No response