Open coderabbittank opened 10 months ago
you can refer to this.
Another question, when I use unimol_tools to fine tune the tox21 dataset, my code looks like this: clf = MolTrain(task='multilabel_classification', data_type='molecule', batch_size=16, metrics='auc', split='random', epochs=20, learning_rate=2e-5, ) pred = clf.fit(data = './train_data.csv')
clf = MolPredict(load_model='/home/zhuyifeng/gitcode/Uni-Mol-main/unimol_tools/unimol_tools/exp') res = clf.predict(data = './test_data.csv')
After data processing my training set is as follows:
SMILES,TARGET1,TARGET2,TARGET3,TARGET4,TARGET5,TARGET6,TARGET7,TARGET8,TARGET9,TARGET10,TARGET11,TARGET12 O=C([O-])COc1ccc(Cl)cc1Cl,0,0,0,0,0,0,0,0,0,0,0,0 ClCC(Cl)CCl,0,0,0,0,1,0,0,0,0,0,0,0 Nc1ccn([C@@H]2OC@HC@@HC2(F)F)c(=O)n1,0,0,0,0,0,0,,0,0,0,0,1 CO,0,0,0,0,0,0,0,0,0,0,0,0 FC1(F)C(F)(F)C(F)(F)C2(F)C(F)(C1(F)F)C(F)(F)C(F)(F)C1(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C12F,0,0,0,0,,0,0,0,0,0,0,0 CC1(C)S[C@@H]2C@HC(=O)N2[C@H]1C(=O)OCOC(=O)[C@@H]1N2C(=O)C[C@H]2S(=O)(=O)C1(C)C,0,,0,0,0,0,,,0,,0, CC@@HC@@Hc1ccccc1,0,0,0,0,0,0,0,0,0,0,0,0 ...
My test set is as follows:
SMILES,TARGET1,TARGET2,TARGET3,TARGET4,TARGET5,TARGET6,TARGET7,TARGET8,TARGET9,TARGET10,TARGET11,TARGET12 O=C(O)CNC(=O)c1ccc(N+[O-])cc1,0,0,0,0,1,0,0,0,0,0,0,0 COC(=O)c1ccccc1C(=O)OC,0,0,0,0,0,0,0,0,0,0,0,0 COc1cc2nc(N3CCN(C(=O)c4ccco4)CC3)nc(N)c2cc1OC,0,,,,0,0,0,0,,0,0, Cc1cccc(N(C)C(=S)Oc2ccc3c(c2)C2CCC3C2)c1,,,,,,,,,,0,, CCN(CC)C(=O)[C@]1(c2ccccc2)C[C@@H]1CN,0,0,0,0,0,0,0,,0,,0,0 SCCSCCS,0,0,0,0,0,0,0,0,0,0,0,0 NC@@Hc(I)c1)C(=O)O,0,0,0,,0,0,1,0,0,0,0,0 CCOC(=O)CC#N,0,0,0,0,0,0,0,0,0,0,0,0 Cc1ccc(Nc2nccc(N(C)c3ccc4c(C)n(C)nc4c3)n2)cc1S(N)(=O)=O,0,0,,,,0,,,,0,,1 c1cc(C(c2ccc(OCC3CO3)cc2)c2ccc(OCC3CO3)cc2)ccc1OCC1CO1,0,,0,,,0,,1,,1,,0
1.But in the process of fine-tuning the training there is a loss value of a very small negative number, and there is an error reported: ValueError: multi_class must be in ('ovo', 'ovr'), please ask how to solve it!
2.Before this I fine-tuned BBBP, BACE, ClinTox, according to the code result 'auroc' is 0.737, 0.85, 0.868 is a little different from the result 0.729, 0.857, 0.919 in the uni-mol paper, the first two classification tasks are almost the same as the paper result, but the latter one has a big difference, is it because of the hyperparameters?
3.Generating molecular conformations during training when using the SIDER dataset is very slow and feels stuck
Here is the code to fine-tune the regression task:
from unimol_tools import MolTrain, MolPredict
from scaffold import load_data,scaffold_split,split_data
import pandas as pd
import csv
train_data_full = pd.read_csv('/gitcode/Uni-Mol-main/unimol_tools/unimol_tools/MoleculeNet/freesolv.csv')
label_count = train_data_full.head(1).shape[1]
target_columns = ["TARGET{}".format(i) for i in range(1, label_count)]
train_data_full.columns = ["SMILES"] + target_columns
train_data_full.to_csv("./mol_train_full.csv", index=False)
data = load_data("./mol_train_full.csv")
train_data, val_data, test_data = split_data(data,'random',[0.8,0.1,0.1],42)
csv_file1 = 'train_data.csv'
csv_columns = ["SMILES"] + target_columns
csv_file2 = 'test_data.csv'
smile_list1 = train_data.smile() + val_data.smile()
label_list1 = train_data.label() + val_data.label()
smile_list2 = test_data.smile()
label_list2 = test_data.label()
with open(csv_file1, 'w', newline='') as file:
writer = csv.DictWriter(file, fieldnames=csv_columns)
writer.writeheader()
for smile, label in zip(smile_list1, label_list1):
row = {'SMILES': smile}
row.update({'TARGET{}'.format(i+1): l if l is not None else None for i, l in enumerate(label)})
writer.writerow(row)
with open(csv_file2, 'w', newline='') as file:
writer = csv.DictWriter(file, fieldnames=csv_columns)
writer.writeheader()
for smile, label in zip(smile_list2, label_list2):
row = {'SMILES': smile}
row.update({'TARGET{}'.format(i+1): l if l is not None else None for i, l in enumerate(label)})
writer.writerow(row)
print(len(train_data)+len(val_data))
print(len(test_data))
clf = MolTrain(task='regression',
data_type='molecule',
batch_size=16,
metrics='mse',
split='random',
epochs=20,
learning_rate=5e-5,
)
pred = clf.fit(data = './train_data.csv')
# currently support data with smiles based csv/txt file, and
# custom dict of {'atoms':[['C','C],['C','H','O']], 'coordinates':[coordinates_1,coordinates_2]}
clf = MolPredict(load_model='/Uni-Mol-main/unimol_tools/unimol_tools/exp')
res = clf.predict(data = './test_data.csv')
I noticed that you used the generation of 3d molecular conformations in conformer.py in the data folder in mol_tools, but it seems that 1 is generated for each molecular conformation, but in the uni-mol paper you mentioned that each molecule is generated 11 conformations. Where is the relevant code? If there is any, please point it out to me. Thank you.