a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.03k stars 131 forks source link

Number of nodes and shape of node coordinates differ by 1 #98

Closed diamondspark closed 2 years ago

diamondspark commented 2 years ago

Hi @a-r-j

I've been observing that some times the number of nodes generated by the library differs to the coordinate data generated by it by exactly 1 node. Do you know why this happens. Following is an example code

configs = {
        "granularity": "CA",
        "keep_hets": False,
        "insertions": False,
        "verbose": False,
        "dssp_config": DSSPConfig(),
        "node_metadata_functions": [meiler_embedding,expasy_protein_scale],
        "edge_construction_functions": [add_peptide_bonds,
                                                  add_hydrogen_bond_interactions,
                                                  add_ionic_interactions,
                                                  add_aromatic_sulphur_interactions,
                                                  add_hydrophobic_interactions,
                                                  add_cation_pi_interactions]
        }
config = ProteinGraphConfig(**configs)
format_convertor = GraphFormatConvertor('nx', 'pyg', 
                                            verbose = 'gnn', 
                                            columns = None)
g = construct_graph(config=config, pdb_code='1c5y')
protdata = format_convertor(g)
print(protdata)

protdata.num_nodes == 256 ; protdata.coords[0].shape==257

Shouldn't these 2 be the same? What am I missing? Thank you!

a-r-j commented 2 years ago

Yep, you're right @diamondspark . There was a bug in how inserted residues are removed from the dataframe. I've pushed a fix to a pending PR. Need to write some tests but hope to get this merged in soon!

There should actually be 238 nodes in the graph as we remove the inserted residues.

diamondspark commented 2 years ago

Thank you for looking into this. This seems to only partially work. I have following follow up concerns

config = ProteinGraphConfig(**configs) format_convertor = GraphFormatConvertor('nx', 'pyg', verbose = 'all_info', columns = ['edge_index','meiler','coords','expasy','node_id','name','dist_mat','num_nodes']) g = construct_graph(config=config, pdb_code='6OGE') protdata = format_convertor(g)

yields different shape for meiler and node features (1483 and 1487)

Data(edge_index=[2, 2472], node_id=[1487], coords=[1], meiler=[1483], expasy=[1483], name=[1], dist_mat=[1], num_nodes=1487)



@a-r-j Can you please look into this again? Thank you!
a-r-j commented 2 years ago

Hi, @diamondspark. I'm working on this. It's a tricky problem resulting from insertions and alt_locs in the PDB files. These aren't always consinstently represented in the file so it's hard to come up with a robust way that catches all the corner cases.

While I'm figuring it out you can try pre-processing the PDBs with the excellent PDB Tools.

a-r-j commented 2 years ago

Hi @diamondspark I believe this is resolved in 1.1.1 - try it out & do let me know if not!