Open MaLab-xmh opened 2 weeks ago
我的数据如下: InChiKey,SMILE,target_aa_code NBYCDVVSYOMFMS-VMPREFPWSA-N,COc1cc(cc(c1OC)OC)C(C(=O)N2CCCC[C@H]2C(=O)OC@@HCCCc4cccnc4)(F)F,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE MWZOULASPWUGJJ-NFBUACBFSA-N,CCC@HC@@HC(=O)O)NC(=O)C@HCC(=O)NO,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE NNZDBCPMOOEFTE-UHFFFAOYSA-N,CC(C)CN1c2c(c(n(n2)Cc3cccc4c3cccc4)c5ccncc5)C(=O)N(C1=O)C,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE KKTYZYHUPKXLPL-RIQJEONASA-N,Cc1cccc(c1OCC(=O)NC@@HC@@H(C)C)O)C,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE SJWOFBVBNFLWLP-UHFFFAOYSA-N,c1ccc(cc1)C2(CCCC2)CN,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE OEVYDSSAPNIURZ-AEFFLSMTSA-N,c1ccc2c(c1)CC@HC(=O)N,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE DCJGHBWTJFHQCR-UEHMVRIRSA-N,CN(C)c1cccc(c1)CNCC@HCc4ccccc4)Cc5ccccc5)O,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE XVOYSCVBGLVSOL-UWTATZPHSA-N,C(C@HN)S(=O)(=O)O,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE NSZDJRLPCLOQAM-UHFFFAOYSA-N,c1ccn(c1)CCOc2ccc3c(c2)cc[nH]3,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE LGXVKMDGSIWEHL-UHFFFAOYSA-N,Cc1c(c2ccc(cc2o1)Oc3ccnc4c3ccc(c4)OCCN5CCOCC5)C(=O)NC,MGVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE
Hello. Thank you for your interest in our work. I am sorry for the delayed response. I have been occupied with multiple academic responsibilities.
I tried your data on AI-Bind and it seems like that the SMILEs in your data are encountering an rdkit error (attached image).
I would recommend the following (in order):
(i) Check the formatting of the chemical SMILEs and their compatibility with rdkit==2022.9.5 (the version used by AI-Bind).
(ii) Try to match the chemical SMILEs with some of the standard chemical databases like PubChem. (I couldn't find the SMILEs on PubChem.)
(iii) If your SMILEs are compatible with any newer version of rdkit, try retraining VecNet with the new rdkit using this notebook: https://github.com/Barabasi-Lab/AI-Bind/blob/main/VecNet/VecNet-Unseen_Nodes.ipynb.
Hope this helps!
Thanks for looking into this issue Ayan.
Looking at Ayan's screenshot, I would like to provide additional direction:
Your SMILES look like modified isomeric SMILES. Isomeric SMILES consider the stereochemistry of the molecule by denoting the orientation around stereocenters within the molecule. The description of the orientation are contained within brackets, [ ], on the stereocenter atom. For example in the first SMILES of the screenshot:
COc1cc(cc(c1OC)OC)C(C(=O)N2CCCC[C@H]2C(=O)OC@@HCCCC4cccnc4)(F)F
I have bolded the two stereocenters. You can see that the first one is in the correct format as the stereocenter description is within brackets while in the second one, the stereocenter is missing its bracket container. It looks like you have some preparation step in your pipeline which might be stripping the SMILES of the necessary brackets which is why rdkit cannot read them.
Hello, thank you for your reply. Previously, it was possible that there were special characters in the smiles that caused me to paste the incorrect ones. I apologize for not checking carefully. I can use my data to test your model.
However, I am still encountering a problem: when I change the input data, the output remains unchanged. I’ve attached my data and the script I’m using to input the data , and I would appreciate it if you could check whether I am correctly passing the input data to the model.
Additionally, could you let me know where I can find or how I can view the output results from the model? I am unsure if the results are being properly generated or stored.
Thank you again for your time and assistance. I look forward to your guidance on these issues. import sys sys.path.append('/home/malab21/disk3/xumh/model/AI-Bind-1.1/') import importlib from matplotlib.pyplot import figure from AIBind.import_modules import * from AIBind import AIBind
importlib.reload(AIBind)
str(subprocess.check_output('nvidia-smi', shell = True)).split('\n') os.environ["CUDA_VISIBLE_DEVICES"] = "1"
with open('data/chemicals_01_w_embed.pkl', 'rb') as file: drugs = pkl.load(file) with open('data/amino_01_w_embed.pkl', 'rb') as file: targets = pkl.load(file)
drugs = drugs.rename(columns = {'Label' : 'InChiKey'}) targets = targets.rename(columns = {'Label' : 'target_aa_code'})
targets_test = [] targets_validation = [] edges_test = [] edges_validation = [] train_sets = []
for run_number in tqdm(range(5)): targets_test.append(pd.read_csv('data/test_unseennodes' + str(run_number) + '.csv')) edges_test.append(pd.read_csv('data/test_unseenedges' + str(run_number) + '.csv')) targets_validation.append(pd.read_csv('data/validation_unseennodes' + str(run_number) + '.csv')) edges_validation.append(pd.read_csv('data/validation_unseenedges' + str(run_number) + '.csv')) train_sets.append(pd.readcsv('data/train' + str(run_number) + '.csv')) vecnet_object = AIBind.AIBind(interactions_location = 'data/Network_Derived_Negatives.csv', interactions = None, interaction_y_name = 'Y',
absolute_negatives_location = None, absolute_negatives = None,
drugs_location = None, drugs_dataframe = drugs, drug_inchi_name = 'InChiKey', drug_smile_name = 'SMILE',
targets_location = None, targets_dataframe = targets, target_seq_name = 'target_aa_code',
mol2vec_location = None, mol2vec_model = None,
protvec_location = None, protvec_model = None,
nodes_test = targets_test, nodes_validation = targets_validation,
edges_test = edges_test, edges_validation = edges_validation,
model_out_dir = 'data/',
debug = False)
with open('data/VecNet_unseen_nodes.pickle', 'rb') as file: vecnet_object = pkl.load(file)
vecnet_object.mol2vec_location = 'data/model_300dim.pkl', vecnet_object.protvec_location = 'data/protVec_100d_3grams.csv',
vecnet_object.protvec_model = pd.read_csv('data/protVec_100d_3grams.csv', delimiter = '\t') vecnet_object.mol2vec_model = word2vec.Word2Vec.load('data/model_300dim.pkl')
# subprocess.check_output('mkdir data/vecnet-final/; cd data/vecnet-final; unzip ../vecnet-final.zip', shell = True)
# None
for _model, _path in vecnet_object.model_out_dir.items(): vecnet_object.model_out_dir[_model] = 'data/' + _path.split('/')[-2] + '/' vecnet_object.drugs_dataframe = drugs vecnet_object.targets_dataframe = targets vecnet_object.get_protvec_embeddings() vecnet_object.get_mol2vec_embeddings()
nodes_df = pd.read_csv('data/testAI.csv')
unseen_nodes_example_5fold_average = vecnet_object.get_fold_averaged_prediction_results(model_name = None, version_number = None, model_paths = [], optimal_validation_model = None, test_sets = [nodes_df], get_drug_embed = True, get_target_embed = True, drug_filter_list = [], target_filter_list = [], return_dataframes = True )
print(targets_test[1])
人间宝藏 @.***
Original
From: "Ayan Chatterjee" @.>
Sent Time: 2024-11-21- 02:51
To: "Barabasi-Lab/AI-Bind" @.>
Cc: "MaLab-xmh" @.> , "Author" @.>
Subject: Re: [Barabasi-Lab/AI-Bind] prediction results (Issue #5)
Hello. Thank you for your interest in our work. I am sorry for the delayed response. I have been occupied with multiple academic responsibilities.
I tried your data on AI-Bind and it seems like that the SMILEs in your data are encountering an rdkit error (attached image).
rdkit_error_AI_Bind.png (view on web)
I would recommend the following (in order):
(i) Check the formatting of the chemical SMILEs and their compatibility with rdkit==2022.9.5 (the version used by AI-Bind).
(ii) Try to match the chemical SMILEs with some of the standard chemical databases like PubChem. (I couldn't find the SMILEs on PubChem.)
(iii) If your SMILEs are compatible with any newer version of rdkit, try retraining VecNet with the new rdkit using this notebook: https://github.com/Barabasi-Lab/AI-Bind/blob/main/VecNet/VecNet-Unseen_Nodes.ipynb.
Hope this helps!
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.
Hello, this is the result of running my script, and the output molecular protein binding probability is not my input data. Did I not pass my data to the model correctly?
人间宝藏 @.***
Original
From: "Ayan Chatterjee" @.>
Sent Time: 2024-11-21- 02:51
To: "Barabasi-Lab/AI-Bind" @.>
Cc: "MaLab-xmh" @.> , "Author" @.>
Subject: Re: [Barabasi-Lab/AI-Bind] prediction results (Issue #5)
Hello. Thank you for your interest in our work. I am sorry for the delayed response. I have been occupied with multiple academic responsibilities.
I tried your data on AI-Bind and it seems like that the SMILEs in your data are encountering an rdkit error (attached image).
rdkit_error_AI_Bind.png (view on web)
I would recommend the following (in order):
(i) Check the formatting of the chemical SMILEs and their compatibility with rdkit==2022.9.5 (the version used by AI-Bind).
(ii) Try to match the chemical SMILEs with some of the standard chemical databases like PubChem. (I couldn't find the SMILEs on PubChem.)
(iii) If your SMILEs are compatible with any newer version of rdkit, try retraining VecNet with the new rdkit using this notebook: https://github.com/Barabasi-Lab/AI-Bind/blob/main/VecNet/VecNet-Unseen_Nodes.ipynb.
Hope this helps!
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.
Expected to have 'InChiKey', 'SMILE', and 'target_aa_code'
nodes_df = pd.read_csv('some_csv_file_path')
Example entries
nodes_df['InChiKey'] = ['HUMNYLRZRPPJDN-UHFFFAOYSA-N']
nodes_df['SMILE'] = ['C1=CC=C(C=C1)C=O']
nodes_df['target_aa_code'] = sars_targets['Sequence'].tolist()[0]
unseen_nodes_example_5fold_average = vecnet_object.get_fold_averaged_prediction_results(model_name = None, version_number = None, model_paths = [], optimal_validation_model = None, test_sets = [targets_test[1].dropna()], get_drug_embed = True, get_target_embed = True, drug_filter_list = [], target_filter_list = [], return_dataframes = True ) Hello, I encountered a problem in the prediction step when running your model. After I read my data into nodes_df, I changed the test_sets = [targets_test[1].dropna()] in the above code to test_sets = [nodes_df.dropna()]. Why are the values of the result variable unseen_nodes_example_5fold_average the same? Unseen_nodes_example_5fold_average stores the prediction results, right?