bpmunson / polygon

POLYGON VAE For de novo Polypharmacology
MIT License
27 stars 8 forks source link

an error in train_ligand_binging_model. #6

Open yanbosmu opened 1 month ago

yanbosmu commented 1 month ago

The first process of train VAE was successful. But an error occurred when train_ligand_binging_model.

polygon train_ligand_binding_model --uniprot_id Q9Y572O --binding_db_path /home/yanbosmu/Bioinfo/polygon/data/output.csv --output_path /home/yanbosmu/Bioinfo/polygon/data/Q9Y572_ligand_binding.pkl Traceback (most recent call last): File "/home/yanbosmu/mambaforge/bin/polygon", line 8, in sys.exit(main()) File "/home/yanbosmu/mambaforge/lib/python3.10/site-packages/polygon/run.py", line 849, in main r = train_ligand_binding_model_main(args)
File "/home/yanbosmu/mambaforge/lib/python3.10/site-packages/polygon/run.py", line 810, in train_ligand_binding_model_main train_ligand_binding_model( args.uniprot_id, File "/home/yanbosmu/mambaforge/lib/python3.10/site-packages/polygon/utils/train_ligand_binding_model.py", line 17, in train_ligand_binding_model binddb = pd.read_csv(binding_db_path, sep="\t",header=0,low_memory=False,error_bad_lines=False) TypeError: read_csv() got an unexpected keyword argument 'error_bad_lines'

GPT said it was because that PANDAS 2.2 lack of error_bad_lines function. So I delete it in the "train_ligand_binding_model.py". But then I got a new error listed below.

polygon train_ligand_binding_model --uniprot_id Q9Y572O --binding_db_path /home/yanbosmu/Bioinfo/polygon/data/output.csv --output_path /home/yanbosmu/Bioinfo/polygon/data/Q9Y572_ligand_binding.pkl Traceback (most recent call last): File "/home/yanbosmu/mambaforge/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'UniProt (SwissProt) Primary ID of Target Chain'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/yanbosmu/mambaforge/bin/polygon", line 8, in sys.exit(main()) File "/home/yanbosmu/mambaforge/lib/python3.10/site-packages/polygon/run.py", line 849, in main r = train_ligand_binding_model_main(args)
File "/home/yanbosmu/mambaforge/lib/python3.10/site-packages/polygon/run.py", line 810, in train_ligand_binding_model_main train_ligand_binding_model( args.uniprot_id, File "/home/yanbosmu/mambaforge/lib/python3.10/site-packages/polygon/utils/train_ligand_binding_model.py", line 20, in train_ligand_binding_model d = binddb[binddb['UniProt (SwissProt) Primary ID of Target Chain']==target_unit_pro_id] File "/home/yanbosmu/mambaforge/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in getitem indexer = self.columns.get_loc(key) File "/home/yanbosmu/mambaforge/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc raise KeyError(key) from err

any ideas or solution to this? Is that because it was the newest pandas version I used?

Feriolet commented 1 month ago

Hi

I cant test your command line now, but I think there are several issue why yours is not working:

  1. For your first problem, as you mentioned, your pandas version may be too high from polygon's script that uses python 3.8 (the one I'm using right now).

  2. For your second problem, it may be because of your uniprot id, which cause the KeyError issue. Indeed, I tried to find your uniprot id online (Q9Y572O), but it does not exist. You can try double checking your uniprot id again.

bpmunson commented 1 month ago

As Feriolet mentioned, the uniprot ID "Q9Y572O" does not seem to be valid. What protein target are you attempting to train a model for?

This issue does highlight that POLYGON should be more graceful when invalid IDs are used.

Best, Brenton

yanbosmu commented 1 month ago

Thank you for your advices. I used the correct uniprot ID "Q9Y572", and reinstall the pandas version 1.2.0. And also I changed to python 3.9 Which TypeError: read_csv() got an unexpected keyword argument 'error_bad_lines' is no longer exist. But still I got

File "/home/yanbosmu/mambaforge/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'UniProt (SwissProt) Primary ID of Target Chain'

I believe that it was something wrong with my BindingDB files. I download TSV files from BindingDB website and then transform into CSV files.

Can you share the CSV files in tutorial? That would help me find out the reason. Thank you so much!

yanbosmu commented 1 month ago

(/home/yanbosmu/your_path/polygonfinal) 20:15:44yanbosmu@Yanbosmu-PC:~/Bioinfo/polygonfinal/polygon$ polygon train_ligand_binding_model --uniprot_id Q9Y572 --binding_db_path /home/yanbosmu/Bioinfo/polygon/data/outputxx.csv --output_path /home/yanbosmu/Bioinfo/polygon/data/Q9Y572_ligand_binding.pkl Traceback (most recent call last): File "/home/yanbosmu/your_path/polygonfinal/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'UniProt (SwissProt) Primary ID of Target Chain'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/yanbosmu/your_path/polygonfinal/bin/polygon", line 8, in sys.exit(main()) File "/home/yanbosmu/your_path/polygonfinal/lib/python3.9/site-packages/polygon/run.py", line 849, in main r = train_ligand_binding_model_main(args)
File "/home/yanbosmu/your_path/polygonfinal/lib/python3.9/site-packages/polygon/run.py", line 810, in train_ligand_binding_model_main train_ligand_binding_model( args.uniprot_id, File "/home/yanbosmu/your_path/polygonfinal/lib/python3.9/site-packages/polygon/utils/train_ligand_binding_model.py", line 20, in train_ligand_binding_model d = binddb[binddb['UniProt (SwissProt) Primary ID of Target Chain']==target_unit_pro_id] File "/home/yanbosmu/your_path/polygonfinal/lib/python3.9/site-packages/pandas/core/frame.py", line 2906, in getitem indexer = self.columns.get_loc(key) File "/home/yanbosmu/your_path/polygonfinal/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc raise KeyError(key) from err KeyError: 'UniProt (SwissProt) Primary ID of Target Chain'

Feriolet commented 1 month ago

Can I know which BindingDB file you downloaded? You do not need to convert the TSV to CSV because the script split the files based on "tab" (i.e., TSV).

I used the BindingDB_All_202407.tsv file in the BindingDB website.

If you still has the KeyError issue, it may mean that the BindigDB may not have strong ligand that binds to your Q9Y572 protein.

However, that would be odd because using the BindingDB website found ligands binding to the uniprot ID, filtered from IC50 <= 1000 nM (Sorry for the bad image) image

yanbosmu commented 1 month ago

Than

Can I know which BindingDB file you downloaded? You do not need to convert the TSV to CSV because the script split the files based on "tab" (i.e., TSV).

I used the BindingDB_All_202407.tsv file in the BindingDB website.

If you still has the KeyError issue, it may mean that the BindigDB may not have strong ligand that binds to your Q9Y572 protein.

However, that would be odd because using the BindingDB website found ligands binding to the uniprot ID, filtered from IC50 <= 1000 nM (Sorry for the bad image) image

Thank you so much ~!!!!!!!!!!!!!!!!!!!! YES, It's because the bindingDB file, I just download Q9Y572 related ligands in this website. After I using BindingDB_All_202407.tsv you mentioned, It works out just fine!!!!

Thank you !

DM0815 commented 1 month ago

Than

Can I know which BindingDB file you downloaded? You do not need to convert the TSV to CSV because the script split the files based on "tab" (i.e., TSV). I used the BindingDB_All_202407.tsv file in the BindingDB website. If you still has the KeyError issue, it may mean that the BindigDB may not have strong ligand that binds to your Q9Y572 protein. However, that would be odd because using the BindingDB website found ligands binding to the uniprot ID, filtered from IC50 <= 1000 nM (Sorry for the bad image) image

Thank you so much ~!!!!!!!!!!!!!!!!!!!! YES, It's because the bindingDB file, I just download Q9Y572 related ligands in this website. After I using BindingDB_All_202407.tsv you mentioned, It works out just fine!!!!

Thank you !

Excuseme, I used another Protein ID to generate. code is ok . But It didnot generate any pkl file. Do you meet this question? In the process, it will remind that these warns:"expected 194 fields, saw 266\nSkipping line 2874651: expected 194 fields, saw 266\nSkipping line 2874652: expected 194 fields, saw 266\n'" Do you have some suggestions? Thanks.

yanbosmu commented 1 month ago

just ignore those warnings. I also saw those errors, but still got hte pkl files

Feriolet commented 1 month ago

Yepp, I also ignore these warnings and still got the pkl result

DM0815 commented 1 month ago

just ignore those warnings. I also saw those errors, but still got hte pkl files

@Feriolet Dear all, I solved the problem because I used my proteinID by revising the script. But in the last step Use the chemical embedding to design polypharmacology compounds, I met the another question, errors as follows: File "/POLYGON/lib/python3.9/site-packages/torch/nn/utils/rnn.py", line 482, in pack_sequence return pack_padded_sequence(pad_sequence(sequences), lengths, enforce_sorted=enforce_sorted) File "/POLYGON/lib/python3.9/site-packages/torch/nn/utils/rnn.py", line 397, in pad_sequence return torch._C._nn.pad_sequence(sequences, batch_first, padding_value) RuntimeError: received an empty list of sequences.

I was wondering if you have had the same problem and how you solved it. Thanks.

Feriolet commented 1 month ago

Can you send the full error message? The logs that you sent is only from the 'torch' package, not the polygon package. The Error indicates that there is no torch tensor for the pack_sequence function. Please see the documentation here: https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_sequence.html

The error can be reproduced by the following:

>>> from torch.nn.utils.rnn import pack_sequence
>>> pack_sequence(torch.tensor([]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/miniforge3/envs/envname/lib/python3.9/site-packages/torch/nn/utils/rnn.py", line 484, in pack_sequence
    return pack_padded_sequence(pad_sequence(sequences), lengths, enforce_sorted=enforce_sorted)
  File "/Users/user/miniforge3/envs/envname/lib/python3.9/site-packages/torch/nn/utils/rnn.py", line 398, in pad_sequence
    return torch._C._nn.pad_sequence(sequences, batch_first, padding_value)
RuntimeError: received an empty list of sequences
DM0815 commented 1 month ago

Can you send the full error message? The logs that you sent is only from the 'torch' package, not the polygon package. The Error indicates that there is no torch tensor for the pack_sequence function. Please see the documentation here: https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_sequence.html

The error can be reproduced by the following:

>>> from torch.nn.utils.rnn import pack_sequence
>>> pack_sequence(torch.tensor([]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/miniforge3/envs/envname/lib/python3.9/site-packages/torch/nn/utils/rnn.py", line 484, in pack_sequence
    return pack_padded_sequence(pad_sequence(sequences), lengths, enforce_sorted=enforce_sorted)
  File "/Users/user/miniforge3/envs/envname/lib/python3.9/site-packages/torch/nn/utils/rnn.py", line 398, in pad_sequence
    return torch._C._nn.pad_sequence(sequences, batch_first, padding_value)
RuntimeError: received an empty list of sequences

Yes I'm very confused. The full error messages as follows: 2024-08-17 16:22:55,376 [DEBUG ] Making scoring function, fpscores.pkl.gz fpscores.pkl.gz [16:22:55] Explicit valence for atom # 0 N, 4, is greater than permitted /home/dm/anaconda3/envs/py3.9/lib/python3.9/site-packages/polygon/utils/utils.py:82: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. model.load_state_dict(torch.load(model_definition, map_location="cpu")) Traceback (most recent call last): File "/home/dm/anaconda3/envs/py3.9/bin/polygon", line 8, in sys.exit(main()) File "/home/dm/anaconda3/envs/py3.9/lib/python3.9/site-packages/polygon/run.py", line 841, in main generate_main(args) File "/home/dm/anaconda3/envs/py3.9/lib/python3.9/site-packages/polygon/run.py", line 658, in generate_main scoring_function = build_scoring_function( File "/home/dm/anaconda3/envs/py3.9/lib/python3.9/site-packages/polygon/utils/utils.py", line 293, in build_scoring_function scorers[name] = LatentDistance( smiles_targets=smiles_targets, File "/home/dm/anaconda3/envs/py3.9/lib/python3.9/site-packages/polygon/utils/custom_scoring_fcn.py", line 154, in init self.z_targets = self.model.encode(self.x_targets) File "/home/dm/anaconda3/envs/py3.9/lib/python3.9/site-packages/polygon/vae/vae_model.py", line 179, in encode z, kl_loss, mu = self.forward_encoder(x, return_mu=True)
File "/home/dm/anaconda3/envs/py3.9/lib/python3.9/site-packages/polygon/vae/vae_model.py", line 225, in forward_encoder x = nn.utils.rnn.pack_sequence(x) File "/home/dm/anaconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/utils/rnn.py", line 482, in pack_sequence return pack_padded_sequence(pad_sequence(sequences), lengths, enforce_sorted=enforce_sorted) File "/home/dm/anaconda3/envs/py3.9/lib/python3.9/site-packages/torch/nn/utils/rnn.py", line 397, in pad_sequence return torch._C._nn.pad_sequence(sequences, batch_first, padding_value) RuntimeError: received an empty list of sequences

I donnot sure which step is wrong. In the scoring_definition.csv, the pkl file and smi file should be matched? I think my files are matched well. Does a pre-trained model cause this error ?

Feriolet commented 4 weeks ago

Yes, the directory you put in the scoring_definition.csv should match the corresponding target of your interest. I am not using POLYGON anymore, so I cant try to reproduce your error. My wild guess is that there is no potent ligand available in BindingDB.tsv file. Can you try to see if there is a potent ligand in the BindingDB website if it is true?