# of batches - Githubissues

Evert-Homan commented 1 year ago

Hi,

I started a job on a small druggable protein using the -n 50 option. How many batches will be generated? The job has been running for 4 h and reached almost 500 batches.

BW/Evert

LIYUESEN commented 1 year ago

A generative model's output needs to balance creativity and accuracy, much like how we need to find novel yet effective solutions in drug design. The two parameters that control the creativity and accuracy of a generative model are top_k and top_p. As of July 11, 2023, we have made top_k and top_p user-definable input parameters. By default, top_k is set to 5 and top_p to 0.6. This means that the model will select from the top 5 most probable predictions, and it will stop considering additional predictions once the cumulative probability exceeds 0.6.

However, in certain situations, such as protein design for drugs, this default setting might lean towards accuracy, leading to insufficient creativity which could limit the generation of novel drugs. To resolve this issue, you could consider increasing the values of top_p or top_k. For instance, you could set top_p to 0.8, allowing the model to have more options during the generation process, potentially enhancing its creativity. You can adjust this parameter by adding --top_p 0.8 in your command line.

Yuesen Li

LIYUESEN commented 1 year ago

Hi,

If the drugs generated from this model prove to be effective, I would be very interested to hear from you.

Yuesen Li

Evert-Homan commented 1 year ago

Hi,

For the record: your program does not generate drugs, but potential drug-like ligands at best. In my vocabulary a drug is an approved substance.

BW/Evert

LIYUESEN commented 1 year ago

Hi Evert,

Thanks for your clarification. I completely agree with your point. Indeed, this model generates potential drug-like ligands, not "drugs" in the strictest sense. The term "drug" is more precisely used to describe substances that have passed rigorous testing and received medical use approval. In our paper, we indeed refer to the output of the model as potential ligands, with the aspiration that these potential ligands can be further validated and possibly become actual drugs, although this is a highly challenging process.

Yuesen Li

Evert-Homan commented 1 year ago

Glad you're onboard with this. I see this terminology being misused all to often in compchem papers, where people claim to identify 'binders' or 'drugs' in silico without actually testing them experimentally, not even in the simplest binding (e.g. DSF) or biochemical assay (e.g. enzyme inhibition).

Evert-Homan commented 1 year ago

Hi,

FYI, the job finished after almost exactly 10 h and 1263 batches. This is on a RTX5000 GPU, which has 3072 CUDA cores. 49 Structures were generated. At first glance many of them appear to contain substructures of known ligands for the target I fed the algorithm, which is maybe not surprising.

What will you get for targets for which there are no known ligands in the BindingDB?

BW/Evert

LIYUESEN commented 1 year ago

Hi Evert,

Thanks for using DrugGPT, and I appreciate your detailed feedback. The phenomenon you observed, where generated structures exist in BindingDB, is indeed expected. The model is trained on these data, and the presence of original structures indicates that the model has learned from them. This is largely because the SMILES representation of drugs is more strict compared to natural language.

If you notice that the model repeatedly generates known structures, it might be because the parameters chosen for the model lead it to favor accuracy over novelty during generation. I suspect that this might be because there are a large number of known ligands for the protein you are studying. Hence, returning known ligands is a way for the model to ensure accuracy. To increase the novelty of the model, you can adjust the top_k and top_p parameters on the command line. For example, you can try increasing them to --top_k 7 --top_p 0.8. If the effect is still not obvious, you might need to further increase the values of top_k and top_p. At the same time, you could also try to increase the value of -n, say from -n 50 to -n 500.

As for how to handle existing generated ligands, I am currently developing a post-processing tool that can put these ligands into a separate folder. You can look forward to my next update.

Best, Yuesen Li

LIYUESEN commented 1 year ago

Hi Evert,

I have an interesting piece of news to share with you. Through my testing of DrugGPT, I've discovered that it may have the ability for "old molecule, new use". Some molecules it predicted do exist in the BindingDB, however, in BindingDB these molecules are not listed as ligands for the protein target I'm working on but are associated with other proteins.

Best, Yuesen Li

cjvsimoes commented 1 year ago

Hi Yuesen,

I have launched a test run on a protein sequence of around 350 aa residues (fasta format) and providing a SMILES string of a putative ligand. I also used the -n 50 option. How do you comment on the fact that drug_generator.py has been running for more than 5 days, has gone through 22k batches and output no file to the ligand_output folder?

Thanks, Carlos

LIYUESEN commented 1 year ago

Hi Carlos,

Firstly, this might be due to the parameters of top_k and top_p. I have adjusted the default values for top_k and top_p in the code based on my usage and feedback from other users. Thus, you may need to redownload the drug_generator.py and rerun your calculation.

Secondly, you are not expected to provide a complete putative ligand's SMILES, but rather a SMILES "prompt". This prompt typically represents a part of the SMILES of the putative ligand, starting from the beginning and ending at some point in the middle. If you provide a complete SMILES of the putative ligand, DrugGPT might consider it as a complete drug, and then it will output the end token and stop generating.

If these suggestions cannot solve your problem, please provide your fasta file and the SMILES prompt to me, and I will analyze and deal with the issue as soon as possible.

Best, Yuesen Li

LIYUESEN / druggpt

# of batches #11