Inquiry on Utilizing Your Trained Model for Molecule Generation in Protein Pockets

gibhdyw commented 2 months ago

Hello,Thank you for sharing your excellent work! Could you please guide me on how to use your trained model to generate molecules for my protein pockets?

Best regards

Layne-Huang commented 2 months ago

Thanks for your interest. Please refer to our readme for the molecule generation for customized proteins.

gibhdyw commented 2 months ago

感谢您的关注。请参阅我们的自述文件，了解定制蛋白质的分子生成。

Thanks for your interest. Please refer to our readme for the molecule generation for customized proteins.

Sorry, I couldn't find your pre-trained model on Zenodo. Could you please provide detailed instructions on how to use your pre-trained model to generate for my protein pocket?

Layne-Huang commented 2 months ago

The 500.pt is the pretrained model.

gibhdyw commented 2 months ago

500.pt 是预训练模型。

Thank you very much for your reply. I have another question: if I want to generate new molecules using the protein I provided, what is the appropriate size for the given pocket in angstroms? Is there any difference between the effects of inputting the entire PDB and inputting just a part of the pocket?

gibhdyw commented 2 months ago

500.pt是预训练模型。 When I use my PDB file following your tutorial, I find that it fails to generate new molecular ligands every time. If I increase num_samples to more than 10, it becomes very slow. How can I solve this problem? Additionally, similar to the question above, can the protein provided be directly downloaded from RCSB and preprocessed by software before being used by the model to generate new molecules, or do I need to specifically provide the pocket for generation? Does the pocket need to contain the original ligand? I look forward to your reply. The terminal is as follows：(DDL) PS D:\Master\code\PMDM-main> python -u sample_for_pdb.py --ckpt "D:\Master\code\PMDM-main\500.pt" --pdb_path "D:\Master\code\PMDM-main\pockers.pdb" --num_atom 55 --save_sdf True --num_samples 2 --sampling_type generalized Entropy of n_nodes: H[N] -1.3862943649291992 [2024-04-25 10:07:17,496::test::INFO] Namespace(pdb_path='D:\Master\code\PMDM-main\pockers.pdb', sdf_path=None, num_atom=55, build_method='reconstruct', config=None, cuda=True, ckpt='D:\Master\code\PMDM-main\500.pt', save_sdf=True, num_samples=2, batch_size=10, resume=None, tag='', clip=1000.0, n_steps=1000, global_start_sigma=inf, w_global_pos=1.0, w_local_pos=1.0, w_global_node=1.0, w_local_node=1.0, savedir=None, sampling_type='generalized', eta=1.0) [2024-04-25 10:07:17,497::test::INFO] {'model': {'type': 'diffusion', 'network': 'MDM_full_pocket_coor_shared', 'hidden_dim': 128, 'protein_hidden_dim': 128, 'num_convs': 3, 'num_convs_local': 3, 'protein_num_convs': 2, 'cutoff': 3.0, 'g_cutoff': 6.0, 'encoder_cutoff': 6.0, 'time_emb': True, 'atom_num_emb': False, 'mlp_act': 'relu', 'beta_schedule': 'sigmoid', 'beta_start': 1e-07, 'beta_end': 0.002, 'num_diffusion_timesteps': 1000, 'edge_order': 3, 'edge_encoder': 'mlp', 'smooth_conv': False, 'num_layer': 9, 'feats_dim': 5, 'soft_edge': True, 'norm_coors': True, 'm_dim': 128, 'context': 'None', 'vae_context': False, 'num_atom': 10, 'protein_feature_dim': 31}, 'train': {'seed': 2021, 'batch_size': 16, 'val_freq': 250, 'max_iters': 500, 'max_grad_norm': 10.0, 'num_workers': 4, 'anneal_power': 2.0, 'optimizer': {'type': 'adam', 'lr': 0.001, 'weight_decay': 0.0, 'beta1': 0.95, 'beta2': 0.999}, 'scheduler': {'type': 'plateau', 'factor': 0.6, 'patience': 10, 'min_lr': 1e-06}, 'transform': {'mask': {'type': 'mixed', 'min_ratio': 0.0, 'max_ratio': 1.2, 'min_num_masked': 1, 'min_num_unmasked': 0, 'p_random': 0.5, 'p_bfs': 0.25, 'p_invbfs': 0.25}, 'contrastive': {'num_real': 50, 'num_fake': 50, 'pos_real_std': 0.05, 'pos_fake_std': 2.0}}}, 'dataset': {'name': 'crossdock', 'type': 'pl', 'path': './data/crossdocked_pocket10', 'split': './data/split_by_name.pt'}} [2024-04-25 10:07:17,498::test::INFO] Loading crossdock data... Entropy of n_nodes: H[N] -3.543935775756836 [2024-04-25 10:07:17,500::test::INFO] Loading data... [2024-04-25 10:07:17,642::test::INFO] Building model... [2024-04-25 10:07:17,643::test::INFO] MDM_full_pocket_coor_shared {'type': 'diffusion', 'network': 'MDM_full_pocket_coor_shared', 'hidden_dim': 128, 'protein_hidden_dim': 128, 'num_convs': 3, 'num_convs_local': 3, 'protein_num_convs': 2, 'cutoff': 3.0, 'g_cutoff': 6.0, 'encoder_cutoff': 6.0, 'time_emb': True, 'atom_num_emb': False, 'mlp_act': 'relu', 'beta_schedule': 'sigmoid', 'beta_start': 1e-07, 'beta_end': 0.002, 'num_diffusion_timesteps': 1000, 'edge_order': 3, 'edge_encoder': 'mlp', 'smooth_conv': False, 'num_layer': 9, 'feats_dim': 5, 'soft_edge': True, 'norm_coors': True, 'm_dim': 128, 'context': 'None', 'vae_context': False, 'num_atom': 10, 'protein_feature_dim': 31} Entropy of n_nodes: H[N] -3.543935775756836 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 98.59it/s] 0%| | 0/1 [00:00<?, ?it/s1 sample: 1000it [08:18, 2.01it/s] Invalid,continue08:18, 2.56it/s] Invalid,continue 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [08:18<00:00, 498.51s/it] [2024-04-25 10:15:36,493::test::INFO] valid:0 [2024-04-25 10:15:36,493::test::INFO] stable:0 (DDL) PS D:\Master\code\PMDM-main>

Layne-Huang commented 2 months ago

Hi, you could use split_pocket_ligand.py to get the pocket which the cutoff is always set 20A or smaller. If you input the whole protein, it mainly affects the efficiency of the molecule generation since PMDM has to consider more conditional information. You can do some experiments to observe if PMDM could generate higher quality molecules by using more protein pocket information.

Layne-Huang commented 2 months ago

500.pt是预训练模型。 When I use my PDB file following your tutorial, I find that it fails to generate new molecular ligands every time. If I increase num_samples to more than 10, it becomes very slow. How can I solve this problem? Additionally, similar to the question above, can the protein provided be directly downloaded from RCSB and preprocessed by software before being used by the model to generate new molecules, or do I need to specifically provide the pocket for generation? Does the pocket need to contain the original ligand? I look forward to your reply. The terminal is as follows：(DDL) PS D:\Master\code\PMDM-main> python -u sample_for_pdb.py --ckpt "D:\Master\code\PMDM-main\500.pt" --pdb_path "D:\Master\code\PMDM-main\pockers.pdb" --num_atom 55 --save_sdf True --num_samples 2 --sampling_type generalized Entropy of n_nodes: H[N] -1.3862943649291992 [2024-04-25 10:07:17,496::test::INFO] Namespace(pdb_path='D:\Master\code\PMDM-main\pockers.pdb', sdf_path=None, num_atom=55, build_method='reconstruct', config=None, cuda=True, ckpt='D:\Master\code\PMDM-main\500.pt', save_sdf=True, num_samples=2, batch_size=10, resume=None, tag='', clip=1000.0, n_steps=1000, global_start_sigma=inf, w_global_pos=1.0, w_local_pos=1.0, w_global_node=1.0, w_local_node=1.0, savedir=None, sampling_type='generalized', eta=1.0) [2024-04-25 10:07:17,497::test::INFO] {'model': {'type': 'diffusion', 'network': 'MDM_full_pocket_coor_shared', 'hidden_dim': 128, 'protein_hidden_dim': 128, 'num_convs': 3, 'num_convs_local': 3, 'protein_num_convs': 2, 'cutoff': 3.0, 'g_cutoff': 6.0, 'encoder_cutoff': 6.0, 'time_emb': True, 'atom_num_emb': False, 'mlp_act': 'relu', 'beta_schedule': 'sigmoid', 'beta_start': 1e-07, 'beta_end': 0.002, 'num_diffusion_timesteps': 1000, 'edge_order': 3, 'edge_encoder': 'mlp', 'smooth_conv': False, 'num_layer': 9, 'feats_dim': 5, 'soft_edge': True, 'norm_coors': True, 'm_dim': 128, 'context': 'None', 'vae_context': False, 'num_atom': 10, 'protein_feature_dim': 31}, 'train': {'seed': 2021, 'batch_size': 16, 'val_freq': 250, 'max_iters': 500, 'max_grad_norm': 10.0, 'num_workers': 4, 'anneal_power': 2.0, 'optimizer': {'type': 'adam', 'lr': 0.001, 'weight_decay': 0.0, 'beta1': 0.95, 'beta2': 0.999}, 'scheduler': {'type': 'plateau', 'factor': 0.6, 'patience': 10, 'min_lr': 1e-06}, 'transform': {'mask': {'type': 'mixed', 'min_ratio': 0.0, 'max_ratio': 1.2, 'min_num_masked': 1, 'min_num_unmasked': 0, 'p_random': 0.5, 'p_bfs': 0.25, 'p_invbfs': 0.25}, 'contrastive': {'num_real': 50, 'num_fake': 50, 'pos_real_std': 0.05, 'pos_fake_std': 2.0}}}, 'dataset': {'name': 'crossdock', 'type': 'pl', 'path': './data/crossdocked_pocket10', 'split': './data/split_by_name.pt'}} [2024-04-25 10:07:17,498::test::INFO] Loading crossdock data... Entropy of n_nodes: H[N] -3.543935775756836 [2024-04-25 10:07:17,500::test::INFO] Loading data... [2024-04-25 10:07:17,642::test::INFO] Building model... [2024-04-25 10:07:17,643::test::INFO] MDM_full_pocket_coor_shared {'type': 'diffusion', 'network': 'MDM_full_pocket_coor_shared', 'hidden_dim': 128, 'protein_hidden_dim': 128, 'num_convs': 3, 'num_convs_local': 3, 'protein_num_convs': 2, 'cutoff': 3.0, 'g_cutoff': 6.0, 'encoder_cutoff': 6.0, 'time_emb': True, 'atom_num_emb': False, 'mlp_act': 'relu', 'beta_schedule': 'sigmoid', 'beta_start': 1e-07, 'beta_end': 0.002, 'num_diffusion_timesteps': 1000, 'edge_order': 3, 'edge_encoder': 'mlp', 'smooth_conv': False, 'num_layer': 9, 'feats_dim': 5, 'soft_edge': True, 'norm_coors': True, 'm_dim': 128, 'context': 'None', 'vae_context': False, 'num_atom': 10, 'protein_feature_dim': 31} Entropy of n_nodes: H[N] -3.543935775756836 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 98.59it/s] 0%| | 0/1 [00:00<?, ?it/s1 sample: 1000it [08:18, 2.01it/s] Invalid,continue08:18, 2.56it/s] Invalid,continue 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [08:18<00:00, 498.51s/it] [2024-04-25 10:15:36,493::test::INFO] valid:0 [2024-04-25 10:15:36,493::test::INFO] stable:0 (DDL) PS D:\Master\code\PMDM-main>

For these problems, 1). please try to decrease the argument num_atom cause most small molecules contain no more than 30 atoms. If you want to generate large molecules, PMDM has to try many times to generate a few. For the slow generation problem, it would be fast by a. using GPU b. using small batch size c. decreasing the size of protein 2). You could just input your pdb file to PMDM to generate molecules. It would be better that you only provide the pocket information. If you want to apply it for lead optimization or linker generation, please provide the reference ligand.

Layne-Huang / PMDM

Inquiry on Utilizing Your Trained Model for Molecule Generation in Protein Pockets #19