DeepGraphLearning / GearNet

GearNet and Geometric Pretraining Methods for Protein Structure Representation Learning, ICLR'2023 (https://arxiv.org/abs/2203.06125)
MIT License
253 stars 28 forks source link

Inference on PDB file by conversion into torchdrug.data.PackedProtein or TorchDrug.data.Protein #55

Closed gtamer2 closed 7 months ago

gtamer2 commented 8 months ago

Hello,

I am getting errors that are blocking me from running GearNet inference on an input PDB file.

First, I loaded a PDB file into a TorchDrug.data.Protein structure. Second, I followed the GearNet graph construction laid out in TorchProtein tutorial 3: Structure-based Protein Property Prediction. I encapsulated the graph construction logic in a function

However, when running these two steps:

protein_graph = torchdrug.data.Protein.from_pdb(path_to_pdb_file)
gearnet_protein_graph = graph_construction_model(protein_graph)

I get the following error:

  File "<path>/torchdrug/layers/geometry/function.py", line 171, in forward
    is_node_in = graph.atom2residue >= (graph.num_cum_residues - graph.num_residues)[graph.atom2graph] - i
AttributeError: 'Protein' object has no attribute 'num_cum_residues'

I studied the source code and found that num_cum_residues is a property of TorchDrug.data.PackedProtein but not of for torchdrug.data.Protein.

So, third, I attempted to convert Protein into PackedProtein, with resulting code:

        protein_graph = Protein.from_pdb(path_to_pdb_file)
        num_edges = protein_graph.edge_list.shape[0]
        num_residues = protein_graph.residue_type.shape[0]

        packed_protein_graph = PackedProtein(edge_list=protein_graph.edge_list,
                                             atom_type=protein_graph.atom_type,
                                             bond_type=protein_graph.bond_type,
                                             residue_type=protein_graph.residue_type,
                                             view=protein_graph.view,
                                             num_edges=[num_edges],
                                             num_residues=torch.tensor(
                                                 num_residues),
                                             )

        gearnet_protein_graph = self.graph_construction_model(
            packed_protein_graph)
        print("gearnet protein graph: {}".format(gearnet_protein_graph))
        return gearnet_protein_graph

However, now I get an error that ValueError: Expect node attributeatom_typeto have shape (16344, *), but found torch.Size([16448]) (16448 is, I assume, the number of nodes derived from the edge_list).

Is this the right approach to run inference with Gearnet? I downloaded the PDB files directly from https://www.wwpdb.org, so I'd like to think the issue is not in the input data. Thank you in advance for any guidance here.

Example PDB files that can't be processed:

gtamer2 commented 8 months ago

I have studied the pretrain/downstream scripts' way of initializing a dataset as here: https://github.com/DeepGraphLearning/GearNet/blob/780809836c87c1028b312241215e856d9b0634b2/script/pretrain.py#L65, but from studying the Torchdrug source code, this method is specific to TorchDrug-registered datasets.

gtamer2 commented 8 months ago

Is the solution to load the PDB files as HDF5 files like gearnet/dataset.py is doing here: https://github.com/DeepGraphLearning/GearNet/blob/780809836c87c1028b312241215e856d9b0634b2/gearnet/dataset.py#L44 and to pass in the Gearnet graph transformation as a parameter here: https://github.com/DeepGraphLearning/GearNet/blob/780809836c87c1028b312241215e856d9b0634b2/gearnet/dataset.py#L94 ?

When I try this, I get an error ERRROR: OSError: Unable to open file (file signature not found), and I'm not sure how to convert a PDB file to HDF5 format.

mpedraza98 commented 8 months ago

I have tried with

_protein = data.Protein.pack([protein])   
protein_ = graph_construction_model(_protein)

as described in the tutorials and had no issue at all

gtamer2 commented 7 months ago

This fixed it for me. Not sure why I missed that option. Thanks!