DeepGraphLearning / GearNet

GearNet and Geometric Pretraining Methods for Protein Structure Representation Learning, ICLR'2023 (https://arxiv.org/abs/2203.06125)
MIT License
253 stars 28 forks source link

Problems in using FC dataset #61

Open Yangqy-16 opened 3 months ago

Yangqy-16 commented 3 months ago

Hello! Thank you for your great work of Torchdrug, GearNet, and ESM-GearNet! Sorry to bother you. I'm trying to extract feature embeddings using GearNet (as discussed in several former issues) on EC, GO, and FC dataset (as provided in https://zenodo.org/records/7593591). It is easy to notice that different from EC and GO where proteins are provided in pdb format, proteins in FC are in hdf5 format, so I use your Fold3d class in GearNet (https://github.com/DeepGraphLearning/GearNet/blob/main/gearnet/dataset.py) to preprocess the data. However, when I pass the Protein class into GearNet network following the instructions in Torchdrug, I met the following errors when running on GPU:

subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

and then

RuntimeError: Error building extension 'torch_ext':
...

...           ...site-packages/torchdrug/utils/extension/torch_ext.cpp:1:
/usr/include/features.h:424:12: fatal error: sys/cdefs.h: No such file or directory
  424 | #  include <sys/cdefs.h>
      |            ^~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.

When running on CPU, I met:

NotImplementedError: Could not run 'aten::view' with arguments from the 'SparseCPU' backend

I searched for the cause of these errors on the Internet but found that I couldn't solve them because they are related to the environment. I'm wondering why I don't meet any of the problems when directly use Protein.from_pdb() on EC and GO, but encounter these problems on FC where I use your Fold3D class to get also data.Protein instances.

For reference, my code is as follows:

...
# graph
graph_construction_model = layers.GraphConstruction(node_layers=[geometry.AlphaCarbonNode()], 
                                                    edge_layers=[geometry.SpatialEdge(radius=10.0, min_distance=5),
                                                                 geometry.KNNEdge(k=10, min_distance=5),
                                                                 geometry.SequentialEdge(max_distance=2)],
                                                    edge_feature="gearnet")

# model
gearnet_edge = models.GearNet(input_dim=21, hidden_dims=[512, 512, 512, 512, 512, 512],
                              num_relation=7, edge_input_dim=59, num_angle_bin=8,
                              batch_norm=True, concat_hidden=True, short_cut=True, readout="sum")
pthfile = 'models/mc_gearnet_edge.pth'
net = torch.load(pthfile, map_location=torch.device(device))
#print('torch succesfully load model')
gearnet_edge.load_state_dict(net)
gearnet_edge.eval()
print('successfully load gearnet')

def get_subdataset_rep(pdbs: list, proteins: list, subroot: str):
    for idx in range(0, len(pdbs), bs):  # reformulate to batches
        pdb_batch = pdbs[idx : min(len(pdbs), idx + bs)]
        protein_batch = proteins[idx : min(len(pdbs), idx + bs)]
        # protein
        _protein = data.Protein.pack(protein_batch)
        _protein.view = "residue"
        print(_protein)
        final_protein = graph_construction_model(_protein)
        print(final_protein)

        # output
        with torch.no_grad():
            output = gearnet_edge(final_protein, final_protein.node_feature.float(), all_loss=None, metric=None)
        print(output['graph_feature'].shape, output['node_feature'].shape)

        counter = 0
        for idx in range(len(final_protein.num_residues)):  # idx: protein/graph id in this batch
            this_graph_feature = output['graph_feature'][idx]
            this_node_feature = output['node_feature'][counter : counter + final_protein.num_residues[idx], :]
            print(this_graph_feature.shape, this_node_feature.shape)
            torch.save((this_graph_feature, this_node_feature), f"{subroot}/{os.path.splitext(pdb_batch[idx])[0].split('/')[-1]}.pt")
            counter += final_protein.num_residues[idx]

        break

# get representations
if args.task not in ['FC', 'fc']:
    for root in roots:
        pdbs = [os.path.join(root, i) for i in os.listdir(root)]

        proteins = []
        for pdb_file in pdbs:
            try:
                protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
                protein.view = "residue"
                proteins.append(protein)
            except:
                error_fn = os.path.basename(root) + '_' if args.task in ['EC', 'ec', 'GO', 'go'] else ''
                with open(f"{error_path}/{args.task}_{error_fn}error.txt", "a") as f:
                    f.write(os.path.splitext(pdb_file)[0].split('/')[-1] + '\n')
                f.close()

            if len(proteins) == bs:  # for debug
                break

        subroot = os.path.join(output_dir, root.split('/')[-1]) if args.task in ['EC', 'ec', 'GO', 'go'] else output_dir
        get_subdataset_rep(pdbs, proteins, subroot)

        break
else:
    transform = transforms.Compose([transforms.ProteinView(view='residue')])
    dataset = Fold3D(root, transform=transform)  #, atom_feature=None, bond_feature=None

    split_sets = dataset.split()  # train_set, valid_set, test_fold_set
    print('There are', len(split_sets), 'sets in total.')

    for split_set in split_sets:
        print(split_set.indices)
        this_slice = slice(list(split_set.indices)[0], (list(split_set.indices)[-1] + 1))
        this_pdbs, this_datas = dataset.pdb_files[this_slice], dataset.data[this_slice]
        #for fn, protein in zip(this_pdbs, this_datas):
        #    print(fn, protein)
        #    break
        get_subdataset_rep(this_pdbs, this_datas, os.path.join(output_dir, this_pdbs[0].split('/')[0]))

Are there any ways to solve the problem, or is my understanding of torchdrug wrong? Sincerely looking forward to your help. Thank you very much!

Oxer11 commented 3 months ago

Hi, I don't think this should be a dataset-specific problem. It seems that you fail to build the torch_extension in TorchDrug. Could you check this?

Yangqy-16 commented 3 months ago

Hi! Thank you for your reply! I checked my torch_extension based on https://github.com/DeepGraphLearning/torchdrug/issues/8 and https://github.com/DeepGraphLearning/torchdrug/issues/238. I'm sure that my torch_ext.cpp lies correctly under torchdrug/utils/extension, and I tried to delete the folder torch_extensions which lies under /home/your_user_name/.cache but it doesn't work.