a-r-j / graphein

Protein Graph Library
https://graphein.ai/
MIT License
1.03k stars 131 forks source link

Cannot pickle after calling graphein.protein.subgraphs.extract_subgraph_from_chains() #135

Closed johnnytam100 closed 2 years ago

johnnytam100 commented 2 years ago

Hi Arian, I am trying the graphein.protein.subgraphs.extract_subgraph_from_chains() you suggested.

After that, I want to pickle the graph.

Originally, pickling with the original graph worked fine:

# config
new_funcs = {"keep_hets": False,
             "edge_construction_functions": [add_peptide_bonds,
                                              add_hydrogen_bond_interactions,
                                              add_disulfide_interactions,
                                              add_ionic_interactions,
                                              add_aromatic_interactions,
                                              add_aromatic_sulphur_interactions,
                                              add_cation_pi_interactions],
            }

config = ProteinGraphConfig(**new_funcs)

# construct graph
g = construct_graph(config=config, pdb_code='2vvi', chain_selection="all")

# Dump graph
with open("2vvi.p", 'wb') as f:
    pickle.dump(g, f)

However, when the graphein.protein.subgraphs.extract_subgraph_from_chains() was called, the pickling of both the original graph and the subgraph became not to work.

# config
new_funcs = {"keep_hets": False,
             "edge_construction_functions": [add_peptide_bonds,
                                              add_hydrogen_bond_interactions,
                                              add_disulfide_interactions,
                                              add_ionic_interactions,
                                              add_aromatic_interactions,
                                              add_aromatic_sulphur_interactions,
                                              add_cation_pi_interactions],
            }

config = ProteinGraphConfig(**new_funcs)

# construct graph
g = construct_graph(config=config, pdb_code='2vvi', chain_selection="all")
s_g = extract_subgraph_from_chains(g, "A")

# Dump graph
with open("2vvi.p", 'wb') as f:
    pickle.dump(g, f)
DEBUG:graphein.protein.graphs:Deprotonating protein. This removes H atoms from the pdb_df dataframe
DEBUG:graphein.protein.graphs:Detected 877 total nodes
INFO:graphein.protein.edges.distance:Found 411 hbond interactions.
INFO:graphein.protein.edges.distance:Found 42 hbond interactions.
INFO:graphein.protein.edges.distance:Found 12 disulfide interactions.
INFO:graphein.protein.edges.distance:Found 955 ionic interactions.
INFO:graphein.protein.edges.distance:Found: 180 aromatic-aromatic interactions
DEBUG:graphein.protein.subgraphs:Found 217 nodes in the chain subgraph.
DEBUG:graphein.protein.subgraphs:Creating subgraph from nodes: ['A:ALA:69', 'A:ALA:160', 'A:ASP:77', 'A:GLY:31', 'A:LYS:138', 'A:ALA:127', 'A:TYR:147', 'A:ILE:157', 'A:LYS:185', 'A:PHE:23', 'A:VAL:148', 'A:ASN:116', 'A:LEU:162', 'A:HIS:217', 'A:GLU:197', 'A:ALA:214', 'A:PHE:79', 'A:SER:39', 'A:ILE:100', 'A:GLN:38', 'A:LEU:163', 'A:PHE:95', 'A:GLU:212', 'A:ASP:55', 'A:TRP:89', 'A:MET:8', 'A:SER:218', 'A:SER:2', 'A:GLY:86', 'A:TYR:211', 'A:LYS:145', 'A:HIS:121', 'A:LYS:5', 'A:LYS:37', 'A:GLY:98', 'A:ASN:11', 'A:LEU:186', 'A:GLY:151', 'A:PHE:114', 'A:PRO:130', 'A:GLU:140', 'A:CYS:171', 'A:PHE:34', 'A:LYS:9', 'A:MET:14', 'A:ALA:3', 'A:PRO:33', 'A:ASN:105', 'A:ASP:26', 'A:ARG:149', 'A:LYS:203', 'A:ILE:56', 'A:PHE:61', 'A:GLU:35', 'A:LEU:137', 'A:TYR:177', 'A:GLU:96', 'A:ARG:170', 'A:LEU:220', 'A:GLY:122', 'A:PRO:72', 'A:ASN:124', 'A:VAL:152', 'A:HIS:201', 'A:THR:30', 'A:LEU:161', 'A:VAL:215', 'A:PHE:83', 'A:LEU:210', 'A:VAL:18', 'A:THR:58', 'A:ASP:112', 'A:ASN:206', 'A:GLU:15', 'A:ASN:128', 'A:ASP:73', 'A:PHE:120', 'A:THR:113', 'A:TYR:189', 'A:PRO:187', 'A:TYR:87', 'A:GLY:99', 'A:TYR:115', 'A:GLN:81', 'A:PRO:84', 'A:ARG:13', 'A:ASP:41', 'A:PRO:51', 'A:HIS:194', 'A:ASP:97', 'A:HIS:190', 'A:ALA:60', 'A:ILE:107', 'A:MET:146', 'A:LEU:12', 'A:LEU:199', 'A:ASP:156', 'A:TYR:205', 'A:CYS:195', 'A:MET:159', 'A:PHE:52', 'A:ASN:17', 'A:GLY:27', 'A:ARG:66', 'A:HIS:213', 'A:ILE:196', 'A:VAL:67', 'A:ASN:65', 'A:ASP:172', 'A:GLY:20', 'A:LYS:182', 'A:GLU:181', 'A:ASP:150', 'A:THR:59', 'A:GLU:90', 'A:GLU:144', 'A:PHE:54', 'A:PHE:125', 'A:TYR:71', 'A:ALA:167', 'A:LYS:32', 'A:GLU:43', 'A:SER:88', 'A:GLY:111', 'A:HIS:21', 'A:ASP:106', 'A:GLY:36', 'A:MET:132', 'A:ASP:202', 'A:ILE:102', 'A:VAL:192', 'A:ARG:119', 'A:VAL:131', 'A:THR:94', 'A:TRP:139', 'A:GLY:16', 'A:SER:92', 'A:TYR:169', 'A:GLY:165', 'A:MET:109', 'A:THR:108', 'A:ASP:204', 'A:GLY:48', 'A:TYR:78', 'A:ARG:174', 'A:GLU:46', 'A:VAL:184', 'A:SER:82', 'A:GLY:183', 'A:PRO:126', 'A:HIS:168', 'A:LYS:134', 'A:SER:142', 'A:ARG:104', 'A:LEU:93', 'A:ARG:91', 'A:ASP:7', 'A:VAL:24', 'A:LYS:180', 'A:THR:143', 'A:GLY:219', 'A:ALA:53', 'A:THR:136', 'A:PRO:221', 'A:LYS:207', 'A:GLU:70', 'A:LYS:80', 'A:THR:158', 'A:HIS:22', 'A:LEU:191', 'A:LEU:153', 'A:SER:173', 'A:VAL:118', 'A:SER:200', 'A:ILE:10', 'A:CYS:101', 'A:MET:40', 'A:ASP:28', 'A:GLN:133', 'A:LYS:209', 'A:GLY:129', 'A:ALA:216', 'A:ALA:179', 'A:LEU:50', 'A:GLY:188', 'A:LYS:85', 'A:GLN:76', 'A:THR:176', 'A:VAL:44', 'A:PHE:68', 'A:LYS:135', 'A:GLY:155', 'A:LEU:42', 'A:ALA:103', 'A:THR:154', 'A:ILE:4', 'A:VAL:123', 'A:LYS:45', 'A:ILE:75', 'A:GLU:164', 'A:ILE:198', 'A:PRO:49', 'A:VAL:208', 'A:LYS:117', 'A:ASN:19', 'A:GLY:47', 'A:ILE:25', 'A:ASN:166', 'A:LYS:178', 'A:THR:175', 'A:PRO:6', 'A:GLU:110', 'A:PRO:141', 'A:LEU:57', 'A:GLY:29', 'A:ASP:193', 'A:HIS:74'].
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-34-e37cc3cc13f7>](https://localhost:8080/#) in <module>()
     18 # Dump graph
     19 with open(pdb + ".p", 'wb') as f:
---> 20     pickle.dump(g, f)

TypeError: can't pickle generator objects

Do you know why graphein.protein.subgraphs.extract_subgraph_from_chains() changed the behavior of pickle?

Thanks again!

a-r-j commented 2 years ago

Hmm, I think I see what’s going on.

Could you try changing some combinations of the parameters in extract_subgraph_from_chains - particularly update_coords=False

johnnytam100 commented 2 years ago

Hi Arian, thanks again for the suggestions! With update_coords=False, the pickle was successful. But then seems the pickled subgraphs have some problems when fit into the machine learning model . That I got an error: RuntimeError: The expanded size of the tensor (1574) must match the existing size (904) at non-singleton dimension 0. Target sizes: [1574, 8]. Tensor sizes: [904, 1] May I know what does update_coords=False do?

a-r-j commented 2 years ago

Perfect!

So, update_coords recomputes the the coordinate array in g.graph["coords"]. Since we are taking a subgraph, we need to remove some of the coordinates from the original. The problem is this line which is a generator comprehension (which causes the pickling to fail). It's an easy fix to switch this to a list comprehension and I will do this today.

To fix your problem in the meantime, I think you can do something like:

g.graph["coords"] = np.array([
    d["coords"] for d in g.nodes(data=True)]
)

After loading the pickle

johnnytam100 commented 2 years ago

I see! Thanks for the workaround! Hmmm, and I got the following error:

graph = pickle.load(f)
graph.graph["coords"] = np.array([d["coords"] for d in graph.nodes(data=True)])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-11-459a4f297166>](https://localhost:8080/#) in <module>()
     32     with open(path, 'rb') as f:  # notice the r instead of w
     33         graph = pickle.load(f)
---> 34         graph.graph["coords"] = np.array([d["coords"] for d in graph.nodes(data=True)])
     35         graph_list.append(graph)
     36 

[<ipython-input-11-459a4f297166>](https://localhost:8080/#) in <listcomp>(.0)
     32     with open(path, 'rb') as f:  # notice the r instead of w
     33         graph = pickle.load(f)
---> 34         graph.graph["coords"] = np.array([d["coords"] for d in graph.nodes(data=True)])
     35         graph_list.append(graph)
     36 

TypeError: tuple indices must be integers or slices, not str

Do you know how to deal with it? Thanks again!!!

a-r-j commented 2 years ago

Yep @johnnytam100, I made a typo. It should be:

g.graph["coords"] = np.array([
    d["coords"] for _, d in g.nodes(data=True)]
)

Just running the tests to get these fixes merged in. 1.2.1 should be available tonight :)

a-r-j commented 2 years ago

Should be good to go! 🚀 pip install graphein==1.2.1

johnnytam100 commented 2 years ago

Thank you @a-r-j ! It is fine now with pip install graphein==1.2.1! :)

a-r-j commented 2 years ago

Perfect! Don’t hesitate to open another issue if you find something else that’s not behaving as expected. It’s incredibly helpful :)