deepmodeling / Uni-Mol

Official Repository for the Uni-Mol Series Methods
MIT License
674 stars 119 forks source link

code problem #188

Open coderabbittank opened 10 months ago

coderabbittank commented 10 months ago

I noticed that you used the generation of 3d molecular conformations in conformer.py in the data folder in mol_tools, but it seems that 1 is generated for each molecular conformation, but in the uni-mol paper you mentioned that each molecule is generated 11 conformations. Where is the relevant code? If there is any, please point it out to me. Thank you.

ZhouGengmo commented 9 months ago

you can refer to this.

coderabbittank commented 9 months ago

Another question, when I use unimol_tools to fine tune the tox21 dataset, my code looks like this: clf = MolTrain(task='multilabel_classification', data_type='molecule', batch_size=16, metrics='auc', split='random', epochs=20, learning_rate=2e-5, ) pred = clf.fit(data = './train_data.csv')

clf = MolPredict(load_model='/home/zhuyifeng/gitcode/Uni-Mol-main/unimol_tools/unimol_tools/exp') res = clf.predict(data = './test_data.csv')

After data processing my training set is as follows:

SMILES,TARGET1,TARGET2,TARGET3,TARGET4,TARGET5,TARGET6,TARGET7,TARGET8,TARGET9,TARGET10,TARGET11,TARGET12 O=C([O-])COc1ccc(Cl)cc1Cl,0,0,0,0,0,0,0,0,0,0,0,0 ClCC(Cl)CCl,0,0,0,0,1,0,0,0,0,0,0,0 Nc1ccn([C@@H]2OC@HC@@HC2(F)F)c(=O)n1,0,0,0,0,0,0,,0,0,0,0,1 CO,0,0,0,0,0,0,0,0,0,0,0,0 FC1(F)C(F)(F)C(F)(F)C2(F)C(F)(C1(F)F)C(F)(F)C(F)(F)C1(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C12F,0,0,0,0,,0,0,0,0,0,0,0 CC1(C)S[C@@H]2C@HC(=O)N2[C@H]1C(=O)OCOC(=O)[C@@H]1N2C(=O)C[C@H]2S(=O)(=O)C1(C)C,0,,0,0,0,0,,,0,,0, CC@@HC@@Hc1ccccc1,0,0,0,0,0,0,0,0,0,0,0,0 ...

My test set is as follows:

SMILES,TARGET1,TARGET2,TARGET3,TARGET4,TARGET5,TARGET6,TARGET7,TARGET8,TARGET9,TARGET10,TARGET11,TARGET12 O=C(O)CNC(=O)c1ccc(N+[O-])cc1,0,0,0,0,1,0,0,0,0,0,0,0 COC(=O)c1ccccc1C(=O)OC,0,0,0,0,0,0,0,0,0,0,0,0 COc1cc2nc(N3CCN(C(=O)c4ccco4)CC3)nc(N)c2cc1OC,0,,,,0,0,0,0,,0,0, Cc1cccc(N(C)C(=S)Oc2ccc3c(c2)C2CCC3C2)c1,,,,,,,,,,0,, CCN(CC)C(=O)[C@]1(c2ccccc2)C[C@@H]1CN,0,0,0,0,0,0,0,,0,,0,0 SCCSCCS,0,0,0,0,0,0,0,0,0,0,0,0 NC@@Hc(I)c1)C(=O)O,0,0,0,,0,0,1,0,0,0,0,0 CCOC(=O)CC#N,0,0,0,0,0,0,0,0,0,0,0,0 Cc1ccc(Nc2nccc(N(C)c3ccc4c(C)n(C)nc4c3)n2)cc1S(N)(=O)=O,0,0,,,,0,,,,0,,1 c1cc(C(c2ccc(OCC3CO3)cc2)c2ccc(OCC3CO3)cc2)ccc1OCC1CO1,0,,0,,,0,,1,,1,,0

1.But in the process of fine-tuning the training there is a loss value of a very small negative number, and there is an error reported: ValueError: multi_class must be in ('ovo', 'ovr'), please ask how to solve it!

2.Before this I fine-tuned BBBP, BACE, ClinTox, according to the code result 'auroc' is 0.737, 0.85, 0.868 is a little different from the result 0.729, 0.857, 0.919 in the uni-mol paper, the first two classification tasks are almost the same as the paper result, but the latter one has a big difference, is it because of the hyperparameters?

3.Generating molecular conformations during training when using the SIDER dataset is very slow and feels stuck

  1. Uni-mol_tools performs 5-fold cross-validation on the training set, which is different from the benchmark comparison of the standard pair of datasets 0.8 -0.1-0.1. The standard practice is to divide the dataset according to 0.8, 0.1, 0.1 into a training set, a validation set and a test set, so that the training set is trained and then validated by the validation set, and then tested by the test set, but the 5-fold cross-validation has an impact on the results. What I did was after dividing the original dataset by 0.8, 0.1, 0.1, I took the 0.8 training and 0.1 validation set together as input training for Moltrain and then tested it using MolPredict with the 0.1 test set. This approach may not be correct and I would appreciate if you could make some suggestions or correct approach for this!

Here is the code to fine-tune the regression task:

from unimol_tools import MolTrain, MolPredict
from scaffold import load_data,scaffold_split,split_data
import pandas as pd
import csv
train_data_full = pd.read_csv('/gitcode/Uni-Mol-main/unimol_tools/unimol_tools/MoleculeNet/freesolv.csv')
label_count = train_data_full.head(1).shape[1]

target_columns = ["TARGET{}".format(i) for i in range(1, label_count)]

train_data_full.columns = ["SMILES"] + target_columns
train_data_full.to_csv("./mol_train_full.csv", index=False) 

data = load_data("./mol_train_full.csv")

train_data, val_data, test_data = split_data(data,'random',[0.8,0.1,0.1],42)

csv_file1 = 'train_data.csv'
csv_columns = ["SMILES"] + target_columns
csv_file2 = 'test_data.csv'

smile_list1 = train_data.smile() + val_data.smile()
label_list1 = train_data.label() + val_data.label()

smile_list2 = test_data.smile()
label_list2 = test_data.label()

with open(csv_file1, 'w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=csv_columns)
    writer.writeheader()
    for smile, label in zip(smile_list1, label_list1):
        row = {'SMILES': smile}
        row.update({'TARGET{}'.format(i+1): l if l is not None else None for i, l in enumerate(label)})
        writer.writerow(row)
with open(csv_file2, 'w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=csv_columns)
    writer.writeheader()
    for smile, label in zip(smile_list2, label_list2):
        row = {'SMILES': smile}
        row.update({'TARGET{}'.format(i+1): l if l is not None else None for i, l in enumerate(label)})
        writer.writerow(row)

print(len(train_data)+len(val_data))
print(len(test_data))
clf = MolTrain(task='regression', 
               data_type='molecule', 
               batch_size=16, 
               metrics='mse',
               split='random',
               epochs=20,
               learning_rate=5e-5,
                )
pred = clf.fit(data = './train_data.csv')
# currently support data with smiles based csv/txt file, and
# custom dict of {'atoms':[['C','C],['C','H','O']], 'coordinates':[coordinates_1,coordinates_2]}

clf = MolPredict(load_model='/Uni-Mol-main/unimol_tools/unimol_tools/exp')
res = clf.predict(data = './test_data.csv')