The attention mask and the pad token id were not set

LIYUESEN / druggpt

DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins

GNU General Public License v3.0

103 stars 14 forks source link

The attention mask and the pad token id were not set #12

Open fabiotrovato opened 1 year ago

fabiotrovato commented 1 year ago

Hi,

I am running DrugGPT on a linux cluster, using the -p option and -n 100. The calculation succeeds, however, I get messages like this one:

=====Batch 1===== Generating ligand SMILES ... The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:50256 for open-end generation.

I am not sure I understand this message and how to solve provide the attention mask. Probably some additional explanations in the README or a short documentation would help.

Thank you, Fabio

LIYUESEN commented 1 year ago

Hi Fabio,

I've already addressed this issue in a previous discussion on GitHub. You can find detailed information through the following link: https://github.com/LIYUESEN/druggpt/issues/10

Best, Yuesen Li

fabiotrovato commented 1 year ago

Hi Yuesen, thanks for you prompt reply. I have read the thread but I am not quite sure to have understood it. When you mentione "mask out filled values in the sequence" and "your input sequences have the same length and you have not filled any sequences", are you referring to which sequence or sequences?

How not having the attention mask may influence my results (one protein sequence only)? Note that for my protein there are no known ligands that bind to the protein. After the druggpt job has finished, some generated molecules are very large and very not drug-like. Some of them look like peptides.

LIYUESEN commented 1 year ago

The code consistently generates sequences starting with "generated," resulting in uniform lengths. we do not perform any padding operation on the generated sequences.

LIYUESEN commented 1 year ago

The generation of peptide-like sequences has also been observed in my own testing. This might be due to the presence of peptides in the dataset used for training. For now, you can choose to ignore the peptides produced. In the next update of DrugGPT's training code, you will have the opportunity to clean the dataset and start training from scratch to address this issue.

Best, Yuesen Li

fabiotrovato commented 1 year ago

Thanks! Given I obtain a heterogeneous set of ligands, how credible would you say the results are? I get peptide-like molecules and also the sizes are very very variable. I can prune the peptide-like sequences, but there might be other features that make the rest of the molecules not very realistic. How do I decide which ones to retain? Is there a score that druggpt can provide to rank the generated molecules "somehow"?

LIYUESEN commented 1 year ago

Hi Fabio,

This is a great question. Evaluation of generative models is indeed a new issue. Here are some relevant articles on the topic:

https://doi.org/10.1021/acs.jcim.2c01355 https://doi.org/10.3389/fphar.2020.565644

In my opinion, the generated small molecules could be evaluated using software such as AutoDock Vina for molecular docking and scoring.I actually evaluate and score the compounds I generate using AutoDock Vina, and then rank them based on the scores.

You also need to be aware that some of the small molecules generated may already exist.For this issue, please refer to https://github.com/LIYUESEN/druggpt/issues/11

Best, Yuesen Li

fabiotrovato commented 1 year ago

Thanks for suggesting the two papers!

Docking is for sure one criterion for scoring the generated compounds.
I wonder if you calculated also the distribution of the more classical scores such as QED, logP, etc... I did not find any such comparison in your paper https://www.biorxiv.org/content/10.1101/2023.06.29.543848v1.full.pdf If you calculated them, can you comment on how these distributions compare with the corresponding scores of the training set?