awslabs / dgl-lifesci

Python package for graph neural networks in chemistry and biology
Apache License 2.0
697 stars 144 forks source link

The default PretrainAtomFeaturizer does not work for the ClinTox dataset. #169

Open shuix007 opened 2 years ago

shuix007 commented 2 years ago

Hi,

I was trying the script in dgl-lifesci/examples/property_prediction/moleculenet for molecular property prediction. I got the following error when running command python classification.py -d ClinTox -mo gin_supervised_masking

Using backend: pytorch Directory classification_results already exists. Processing dgl graphs from scratch... Traceback (most recent call last): File "classification.py", line 186, in n_jobs=1 if args['num_workers'] == 0 else args['num_workers']) File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/data/clintox.py", line 109, in init n_jobs=n_jobs) File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/data/csv_dataset.py", line 78, in init load, log_every, init_mask, n_jobs, error_log) File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/data/csv_dataset.py", line 139, in _pre_process edge_featurizer=edge_featurizer)) File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/utils/mol_to_graph.py", line 375, in smiles_to_bigraph canonical_atom_order, explicit_hydrogens, num_virtual_nodes) File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/utils/mol_to_graph.py", line 276, in mol_to_bigraph canonical_atom_order, explicit_hydrogens, num_virtual_nodes) File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/utils/mol_to_graph.py", line 90, in mol_to_graph g.ndata.update(node_featurizer(mol)) File "/export/scratch/Zeren/conda/lib/python3.7/site-packages/dgllife/utils/featurizers.py", line 1293, in call self._atomic_number_types.index(atom.GetAtomicNum()), ValueError: 0 is not in list

It seems that there exist atoms in the ClinTox dataset that return 0 when calling GetAtomicNum() that is out of the default atomic_number_types of PretrainAtomFeaturizer(). The problem could be resolved by passing node_featurizer=PretrainAtomFeaturizer(atomic_number_types=list(range(119))) when constructing the ClinTox dataset. But not sure what does a 0 atomic number mean.

mufeili commented 2 years ago

I remember there are * in a few SMILES strings, which stand for an arbitrary atom, which might get assigned an atomic number 0. See: