a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.01k stars 126 forks source link

Format convertor for molecules #288

Closed Tigerrr07 closed 1 year ago

Tigerrr07 commented 1 year ago

Is your feature request related to a problem? Please describe. I constructed a molecule graph by graphein.molecule module like the below, but I can't convert it to a PyG graph using GraphFormatConvertor.

graph = gm.construct_graph(smiles="CC(=O)OC1=CC=CC=C1C(=O)O", config=drug_config)
drug_format_convertor = GraphFormatConvertor('nx', 'pyg',  verbose = 'all_info')

Describe the solution you'd like I want a similar way like protein to convert a molecule graph to a PyG graph.

Additional context In Document, it should specify the length of every feature. Like degree, Degree: the degree (0-5) of this atom.

a-r-j commented 1 year ago

Hi @Tigerrr07 I'll check this out. Could you share your drug_config? :grin:

Tigerrr07 commented 1 year ago

Hi @Tigerrr07 I'll check this out. Could you share your drug_config? 😁

Yeah, here is my drug_config:

drug_configs = {
    "node_metadata_functions": [gm.atom_type_one_hot,
                                gm.formal_charge,
                                gm.hybridization,
                                gm.is_aromatic,
                                gm.degree,
                                gm.total_num_h,
                                ],
    "edge_metadata_functions": [gm.add_bond_type,
                                gm.bond_is_aromatic,
                                gm.bond_is_in_ring,
                                gm.bond_is_conjugated,
                                gm.bond_stereo
                                ]
}
a-r-j commented 1 year ago

Hi @Tigerrr07

It looks like the behaviour for verbose="all_info" defaults to being protein-specific.

You can try instead with:

import graphein.molecule as gm
from graphein.ml import GraphFormatConvertor

drug_configs = {
    "node_metadata_functions": [gm.atom_type_one_hot,
                                gm.formal_charge,
                                gm.hybridization,
                                gm.is_aromatic,
                                gm.degree,
                                gm.total_num_h,
                                ],
    "edge_metadata_functions": [gm.add_bond_type,
                                gm.bond_is_aromatic,
                                gm.bond_is_in_ring,
                                gm.bond_is_conjugated,
                                gm.bond_stereo
                                ]
}

config = gm.MoleculeGraphConfig(**drug_configs)
node_columns = ['atomic_num', 'element', 'rdmol_atom', 'coords', 'atom_type_one_hot', 'formal_charge', 'hybridization', 'is_aromatic', 'degree', 'total_num_h']

graph = gm.construct_graph(smiles="CC(=O)OC1=CC=CC=C1C(=O)O", config=config)
drug_format_convertor = GraphFormatConvertor('nx', 'pyg',  columns = node_columns)

p = drug_format_convertor(graph)
print(p)

Which outputs:

Data(node_id=[13], atomic_num=[13], element=[13], rdmol_atom=[13], coords=[13], atom_type_one_hot=[13, 11], formal_charge=[13], hybridization=[13], is_aromatic=[13], degree=[13], total_num_h=[13], num_nodes=13)
Tigerrr07 commented 1 year ago

Thank you! It works. I also want to know the range of discrete features, like degree and total_num_h, so I can make one-hot feature for them.

a-r-j commented 1 year ago

Hmm. How big is your dataset?

If you can fit it in memory you can do:

graphs = [graph, graph, graph]
max_num_h = 0
for g in graphs:
    for n, d in g.nodes(data=True):
        max_num_h = max(max_num_h, d['total_num_h'])
print(max_num_h)

or

import torch
from torch_geometric.data import Batch

b = Batch.from_data_list([p, p, p])
torch.max(b.total_num_h)

If it won't fit in memory you can just run this with a buffer. Otherwise, you could set a sane max & clip higher values. For degree, for instance, if you're working with small organic molecules (and using only bonds as edges) it's very unlikely that you'll see a degree > 4.

Tigerrr07 commented 1 year ago

Thank you for that! my dataset won't be too large.