carbonsilicon-ai / CarsiDock

Official repo of "CarsiDock: a deep learning paradigm for accurate protein–ligand docking and screening based on large-scale pre-training" proposed by CarbonSilicon AI.
http://dx.doi.org/10.1039/D3SC05552C
Apache License 2.0
65 stars 6 forks source link

Fail to reproduce Posebuster benchmark result #11

Open simmed00 opened 2 weeks ago

simmed00 commented 2 weeks ago

I followed the instruction of using run_docking_inference.py, supplied pdb file from Posebuster dataset, sdf file of ligand from Posebuster dataset, and txt file of smiles converted from ligand. I tested through the 428 complexes in Posebusters, and used the setting of 1 conformation generated from smiles and save 10 docking poses each conformation. I checked the output poses from Carsidock with following code: true_file = root + '/' + subject + '/' + subject + '_ligand.sdf' cond_file = root + '/' + subject + '/' + subject + '_protein.pdb' test_file = carsidock_root + '/' + subject + '_protein_0.sdf' buster = PoseBusters(config="redock") try: df = buster.bust([test_file], true_file, cond_file, full_report=True) print(df) df.to_csv(root + '/' + subject + '/carsidock' + '.csv') except Exception as e: print(subject, e) The final result is: 187 (43%) pass <2A rmsd and PB-valid, 337(78%) pass < 2A rmsd and PB-valid. This is quite some gap compared to the reported performance. Is there any thing I may do it wrong in the above process? Thank you very much for your help.

gitabtion commented 2 weeks ago

Thanks for your attention to CarsiDock! Please provide the docking script and arugments.

simmed00 commented 2 weeks ago

python run_docking_inference.py --pdb_file ' + pdb_file + ' --sdf_file ' + sdf_file + ' --smiles_file ' + smiles_file + ' --output_dir /root/CarsiDock/outputs/posebusters --cuda_convert. pdb file is from Posebuster benchmark set directly. sdf file is also from Posebuster benchmark set directly. smiles_file is smiles converted from sdf to smiles using rdkit.

other default setting in running_docking_inference.py parser.add_argument('--pdb_file', default="example_data/4YKQ_hsp90_40_water.pdb", help='crystal protein .pdb file.') parser.add_argument('--sdf_file', default='example_data/4YKQ_hsp90_40.sdf', help='crystal ligand .sdf file, we need this to get the pocket.') parser.add_argument('--smiles_file', default=None, help='smiles file to docking, txt file with One smiles per line. You dont need to provide it when redocking.') parser.add_argument('--output_dir', default='outputs/conformer') parser.add_argument('--num_conformer', default=1, help='number of initial conformer, resulting in num_conformer * num_conformer docking conformations.') parser.add_argument('--ckpt_path', default='checkpoints/carsidock_230731.ckpt') parser.add_argument('--num_threads', default=1, help='recommend 1') parser.add_argument('--cuda_convert', action='store_true', help='use cuda to accelerate distance matrix to coordinate.') parser.add_argument('--cuda_device_index', default=0, type=int, help="gpu device index")

gitabtion commented 1 week ago

--num_conformer, the recommended setting for this parameter is 10. If the speed is slow, set it to at least 3 or higher.

simmed00 commented 1 week ago

Thanks a lot for the suggestion. One more thing, I notice the output is ranked by rmsd, is this rmsd calclated between the prediction and the GT pose? or is it predicted by the model itself? In case there is no GT pose available, how should one pick a pose from a large number of predictions?