Support for inference on multiple ligands in sdf and smi formats

amfaber commented 2 years ago

The main part of these suggested changes are the datasets/multiple_ligands.py and multiligand_inference.py.

datasets/multiple_ligands.py implements a pytorch dataset to load ligands from a given .sdf or .smi file, which when combined with a dataloader is able to batch the data for better GPU utilization.

multiligand_inference.py utilizes this dataloader to perform inference on a given .sdf or .smi file, writing results as the inference is being run, as a safeguard against losing work if the process crashes or is interrupted.

Suggested usage is

python multiligand_infernce.py -o path/to/output_directory -r path/to/receptor.pdb -l path/to/ligands.sdf

This runs EquiBind on every ligand in ligands.sdf against the protein in receptor.pdb. The output is 3 files in output_directory with the following names and contents:

failed.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference failed in a way that was caught and handled. success.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference succeeded output.sdf - contains the conformers produced by EquiBind in .sdf format.

Along these, a number of options are provided. A few of interest are:

--no_skip: By default, the script looks for failed.txt and success.txt in output_directory, and skips all the ligands with the same index as the ones listed in those files, considering them to be previously calculated work, and any further work to the files already present. --no_skip turns this behavior off and overwrites the 3 files in output_directory if they were already present.

--batch_size: Controls the batch size for sending the receptor and ligand graphs to the GPU. Be aware that due to how batching of graphs works, a large batch size will take up a lot of space.

--n_workers_data_load: Controls the amount of workers spawned by the pytorch DataLoader. These will be responsible for the preproccessing on each batch, namely generating the ligand graph for each ligand.

HannesStark commented 2 years ago

Thank you very much! I merged the commits with cherry pick!

amfaber commented 2 years ago

Awesome that you could use the work! 🚀 I see that you didn't include the failsafe to restore the lig_graph.ndata["feats"] in case of an assertion error. I'll just post my reasoning here for completeness: During my testing, I found that without this, the data of lig_graphs.ndata["feats"] is changed in-place by the EquiBind model, meaning that the same lig_graph can't be re-run twice, as the dimensions of lig_graphs.ndata["feats"] will be changed after the first run. This presents an issue when we are batching the inputs, and the lig_graph contains many ligands, any of which might fail. In the case of failure, the script unbatches and runs each ligand individually, which is only possible if the lig_graph is intact from the first attempt. If you see any problems with the implementation, I would be very happy to know :)

HannesStark commented 2 years ago

So is this necessary for your code to work?

amfaber commented 2 years ago

Only for proper error handling in the case that some of the ligands in the sdf fail to run through EquiBind for whatever reason

HannesStark / EquiBind

Support for inference on multiple ligands in sdf and smi formats #37