Barabasi-Lab / AI-Bind

Interpretable AI pipeline improving binding predictions for novel protein targets and ligands
MIT License
37 stars 3 forks source link

prediction results #5

Open MaLab-xmh opened 2 weeks ago

MaLab-xmh commented 2 weeks ago

Expected to have 'InChiKey', 'SMILE', and 'target_aa_code'

nodes_df = pd.read_csv('some_csv_file_path')

Example entries

nodes_df['InChiKey'] = ['HUMNYLRZRPPJDN-UHFFFAOYSA-N']

nodes_df['SMILE'] = ['C1=CC=C(C=C1)C=O']

nodes_df['target_aa_code'] = sars_targets['Sequence'].tolist()[0]

unseen_nodes_example_5fold_average = vecnet_object.get_fold_averaged_prediction_results(model_name = None, version_number = None, model_paths = [], optimal_validation_model = None, test_sets = [targets_test[1].dropna()], get_drug_embed = True, get_target_embed = True, drug_filter_list = [], target_filter_list = [], return_dataframes = True ) Hello, I encountered a problem in the prediction step when running your model. After I read my data into nodes_df, I changed the test_sets = [targets_test[1].dropna()] in the above code to test_sets = [nodes_df.dropna()]. Why are the values ​​of the result variable unseen_nodes_example_5fold_average the same? Unseen_nodes_example_5fold_average stores the prediction results, right?

MaLab-xmh commented 2 weeks ago

我的数据如下: InChiKey,SMILE,target_aa_code NBYCDVVSYOMFMS-VMPREFPWSA-N,COc1cc(cc(c1OC)OC)C(C(=O)N2CCCC[C@H]2C(=O)OC@@HCCCc4cccnc4)(F)F,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE MWZOULASPWUGJJ-NFBUACBFSA-N,CCC@HC@@HC(=O)O)NC(=O)C@HCC(=O)NO,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE NNZDBCPMOOEFTE-UHFFFAOYSA-N,CC(C)CN1c2c(c(n(n2)Cc3cccc4c3cccc4)c5ccncc5)C(=O)N(C1=O)C,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE KKTYZYHUPKXLPL-RIQJEONASA-N,Cc1cccc(c1OCC(=O)NC@@HC@@H(C)C)O)C,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE SJWOFBVBNFLWLP-UHFFFAOYSA-N,c1ccc(cc1)C2(CCCC2)CN,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE OEVYDSSAPNIURZ-AEFFLSMTSA-N,c1ccc2c(c1)CC@HC(=O)N,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE DCJGHBWTJFHQCR-UEHMVRIRSA-N,CN(C)c1cccc(c1)CNCC@HCc4ccccc4)Cc5ccccc5)O,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE XVOYSCVBGLVSOL-UWTATZPHSA-N,C(C@HN)S(=O)(=O)O,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE NSZDJRLPCLOQAM-UHFFFAOYSA-N,c1ccn(c1)CCOc2ccc3c(c2)cc[nH]3,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE LGXVKMDGSIWEHL-UHFFFAOYSA-N,Cc1c(c2ccc(cc2o1)Oc3ccnc4c3ccc(c4)OCCN5CCOCC5)C(=O)NC,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE

ChatterjeeAyan commented 2 days ago

Hello. Thank you for your interest in our work. I am sorry for the delayed response. I have been occupied with multiple academic responsibilities.

I tried your data on AI-Bind and it seems like that the SMILEs in your data are encountering an rdkit error (attached image).

rdkit_error_AI_Bind

I would recommend the following (in order):

(i) Check the formatting of the chemical SMILEs and their compatibility with rdkit==2022.9.5 (the version used by AI-Bind).

(ii) Try to match the chemical SMILEs with some of the standard chemical databases like PubChem. (I couldn't find the SMILEs on PubChem.)

(iii) If your SMILEs are compatible with any newer version of rdkit, try retraining VecNet with the new rdkit using this notebook: https://github.com/Barabasi-Lab/AI-Bind/blob/main/VecNet/VecNet-Unseen_Nodes.ipynb.

Hope this helps!

Acario commented 2 days ago

Thanks for looking into this issue Ayan.

Looking at Ayan's screenshot, I would like to provide additional direction:

Your SMILES look like modified isomeric SMILES. Isomeric SMILES consider the stereochemistry of the molecule by denoting the orientation around stereocenters within the molecule. The description of the orientation are contained within brackets, [ ], on the stereocenter atom. For example in the first SMILES of the screenshot:

COc1cc(cc(c1OC)OC)C(C(=O)N2CCCC[C@H]2C(=O)OC@@HCCCC4cccnc4)(F)F

I have bolded the two stereocenters. You can see that the first one is in the correct format as the stereocenter description is within brackets while in the second one, the stereocenter is missing its bracket container. It looks like you have some preparation step in your pipeline which might be stripping the SMILES of the necessary brackets which is why rdkit cannot read them.

MaLab-xmh commented 1 day ago

Hello, thank you for your reply. Previously, it was possible that there were special characters in the smiles that caused me to paste the incorrect ones. I apologize for not checking carefully. I can use my data to test your model.

However, I am still encountering a problem: when I change the input data, the output remains unchanged. I’ve attached my data and the script I’m using to input the data , and I would appreciate it if you could check whether I am correctly passing the input data to the model.

Additionally, could you let me know where I can find or how I can view the output results from the model? I am unsure if the results are being properly generated or stored.

Thank you again for your time and assistance. I look forward to your guidance on these issues. import sys sys.path.append('/home/malab21/disk3/xumh/model/AI-Bind-1.1/') import importlib from matplotlib.pyplot import figure from AIBind.import_modules import * from AIBind import AIBind

importlib.reload(AIBind)

str(subprocess.check_output('nvidia-smi', shell = True)).split('\n') os.environ["CUDA_VISIBLE_DEVICES"] = "1"

Read In drugs and targets dataframes to pass to AIBind after changing column names

with open('data/chemicals_01_w_embed.pkl', 'rb') as file:     drugs = pkl.load(file)     with open('data/amino_01_w_embed.pkl', 'rb') as file:     targets = pkl.load(file)

Ensure correct column names    

drugs = drugs.rename(columns = {'Label' : 'InChiKey'}) targets = targets.rename(columns = {'Label' : 'target_aa_code'})

targets_test = [] targets_validation = [] edges_test = [] edges_validation = [] train_sets = []

for run_number in tqdm(range(5)):         targets_test.append(pd.read_csv('data/test_unseennodes' + str(run_number) + '.csv'))     edges_test.append(pd.read_csv('data/test_unseenedges' + str(run_number) + '.csv'))         targets_validation.append(pd.read_csv('data/validation_unseennodes' + str(run_number) + '.csv'))         edges_validation.append(pd.read_csv('data/validation_unseenedges' + str(run_number) + '.csv'))         train_sets.append(pd.readcsv('data/train' + str(run_number) + '.csv')) vecnet_object = AIBind.AIBind(interactions_location = 'data/Network_Derived_Negatives.csv',                               interactions = None,                               interaction_y_name = 'Y',

                              absolute_negatives_location = None,                               absolute_negatives = None,

                              drugs_location = None,                               drugs_dataframe = drugs,                               drug_inchi_name = 'InChiKey',                               drug_smile_name = 'SMILE',

                              targets_location = None,                               targets_dataframe = targets,                               target_seq_name = 'target_aa_code',

                              mol2vec_location = None,                               mol2vec_model = None,

                              protvec_location = None,                               protvec_model = None,

                              nodes_test = targets_test,                               nodes_validation = targets_validation,

                              edges_test = edges_test,                               edges_validation = edges_validation,

                              model_out_dir = 'data/',

                              debug = False)

with open('data/VecNet_unseen_nodes.pickle', 'rb') as file:     vecnet_object = pkl.load(file)

vecnet_object.mol2vec_location = 'data/model_300dim.pkl', vecnet_object.protvec_location = 'data/protVec_100d_3grams.csv',

vecnet_object.protvec_model = pd.read_csv('data/protVec_100d_3grams.csv', delimiter = '\t') vecnet_object.mol2vec_model = word2vec.Word2Vec.load('data/model_300dim.pkl')

Unzip the models into the right folders

Can run directly in shell too

try:

  #  subprocess.check_output('mkdir data/vecnet-final/; cd data/vecnet-final; unzip ../vecnet-final.zip', shell = True)

except:

  #  None

Update model paths

for _model, _path in vecnet_object.model_out_dir.items():     vecnet_object.model_out_dir[_model] = 'data/' + _path.split('/')[-2] + '/'     vecnet_object.drugs_dataframe = drugs vecnet_object.targets_dataframe = targets vecnet_object.get_protvec_embeddings() vecnet_object.get_mol2vec_embeddings()

Expected to have 'InChiKey', 'SMILE', and 'target_aa_code'

nodes_df = pd.read_csv('data/testAI.csv')

Example entries

nodes_df['InChiKey'] = ['HUMNYLRZRPPJDN-UHFFFAOYSA-N']

nodes_df['SMILE'] = ['C1=CC=C(C=C1)C=O']

nodes_df['target_aa_code'] = sars_targets['Sequence'].tolist()[0]

unseen_nodes_example_5fold_average = vecnet_object.get_fold_averaged_prediction_results(model_name = None,                                                                                      version_number = None,                                                                                      model_paths = [],                                                                                      optimal_validation_model = None,                                                                                      test_sets = [nodes_df],                                                                                      get_drug_embed = True,                                                                                      get_target_embed = True,                                                                                      drug_filter_list = [],                                                                                      target_filter_list = [],                                                                                      return_dataframes = True )

print(targets_test[1])

print(unseen_nodes_example_5fold_average)

index = [0]

result_df = pd.DataFrame(list(unseen_nodes_example_5fold_average.items()),index=index)

result_df.columns = ['Value']

result_df.to_csv('data/myResults.csv')

人间宝藏 @.***

     Original

From: "Ayan Chatterjee" @.>
Sent Time: 2024-11-21- 02:51 To: "Barabasi-Lab/AI-Bind"
@.>
Cc: "MaLab-xmh" @.> , "Author" @.>
Subject: Re: [Barabasi-Lab/AI-Bind] prediction results (Issue #5)

Hello. Thank you for your interest in our work. I am sorry for the delayed response. I have been occupied with multiple academic responsibilities.

I tried your data on AI-Bind and it seems like that the SMILEs in your data are encountering an rdkit error (attached image).

rdkit_error_AI_Bind.png (view on web)

I would recommend the following (in order):

(i) Check the formatting of the chemical SMILEs and their compatibility with rdkit==2022.9.5 (the version used by AI-Bind).

(ii) Try to match the chemical SMILEs with some of the standard chemical databases like PubChem. (I couldn't find the SMILEs on PubChem.)

(iii) If your SMILEs are compatible with any newer version of rdkit, try retraining VecNet with the new rdkit using this notebook: https://github.com/Barabasi-Lab/AI-Bind/blob/main/VecNet/VecNet-Unseen_Nodes.ipynb.

Hope this helps!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

MaLab-xmh commented 1 day ago

Hello, this is the result of running my script, and the output molecular protein binding probability is not my input data. Did I not pass my data to the model correctly?

人间宝藏 @.***

     Original

From: "Ayan Chatterjee" @.>
Sent Time: 2024-11-21- 02:51 To: "Barabasi-Lab/AI-Bind"
@.>
Cc: "MaLab-xmh" @.> , "Author" @.>
Subject: Re: [Barabasi-Lab/AI-Bind] prediction results (Issue #5)

Hello. Thank you for your interest in our work. I am sorry for the delayed response. I have been occupied with multiple academic responsibilities.

I tried your data on AI-Bind and it seems like that the SMILEs in your data are encountering an rdkit error (attached image).

rdkit_error_AI_Bind.png (view on web)

I would recommend the following (in order):

(i) Check the formatting of the chemical SMILEs and their compatibility with rdkit==2022.9.5 (the version used by AI-Bind).

(ii) Try to match the chemical SMILEs with some of the standard chemical databases like PubChem. (I couldn't find the SMILEs on PubChem.)

(iii) If your SMILEs are compatible with any newer version of rdkit, try retraining VecNet with the new rdkit using this notebook: https://github.com/Barabasi-Lab/AI-Bind/blob/main/VecNet/VecNet-Unseen_Nodes.ipynb.

Hope this helps!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.