about an error in the multi-task regression

yxnyu commented 6 months ago

Hi,

Thanks for your wondering work. I am wondering if I could ask some question about the error in the multi-task regression in unimol.

My label of X,Y is like:

_SMILES,TARGET_0,TARGET_1,TARGET_2,TARGET_3,TARGET_4,TARGET_5,TARGET_6,TARGET_7,TARGET_8,TARGET_9,TARGET_10,TARGET_11,TARGET_12,TARGET_13,TARGET_14,TARGET_15,TARGET_16,TARGET_17,TARGET_18,TARGET_19,TARGET_20,TARGET_21,TARGET_22,TARGET_23,TARGET_24,TARGET_25,TARGET_26,TARGET_27,TARGET_28,TARGET_29,TARGET_30,TARGET_31,TARGET_32,TARGET_33,TARGET_34,TARGET_35,TARGET_36,TARGET_37,TARGET_38,TARGET_39,TARGET_40,TARGET_41,TARGET_42,TARGET_43,TARGET_44,TARGET_45,TARGET_46,TARGET_47,TARGET_48,TARGET_49,TARGET_50,TARGET_51,TARGET_52,TARGET_53,TARGET_54,TARGET_55,TARGET_56,TARGET_57,TARGET_58,TARGET_59,TARGET_60,TARGET_61,TARGET_62,TARGET_63,TARGET_64,TARGET_65,TARGET_66,TARGET_67,TARGET_68,TARGET_69,TARGET_70,TARGET_71,TARGET_72,TARGET_73,TARGET_74,TARGET_75,TARGET_76,TARGET_77,TARGET_78,TARGET_79,TARGET_80,TARGET_81,TARGET_82,TARGET_83,TARGET_84,TARGET_85,TARGET_86,TARGET_87,TARGET_88,TARGET_89,TARGET90,TA

And there is 140k smiles and 700+labels. However, when I run the bohruim, the unimol shows that:

but I check my datasets:

I also used to transfer my datasets to the float16 and run the unimol but it is the same result.

I carefully check the output, the training is fine while the validation is not ok.

the full output is `2023-12-25 18:01:23 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:23 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:23 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:24 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:24 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:24 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:24 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:25 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:25 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:25 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:25 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:25 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:26 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:26 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:26 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:26 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:26 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:27 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:27 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-25 18:01:32 | unimol/data/conformer.py | 62 | INFO | Uni-Mol(QSAR) | Start generating conformers... 129817it [09:26, 228.99it/s] 2023-12-25 18:11:00 | unimol/data/conformer.py | 66 | INFO | Uni-Mol(QSAR) | Failed to generate conformers for 0.00% of molecules. 2023-12-25 18:11:00 | unimol/data/conformer.py | 68 | INFO | Uni-Mol(QSAR) | Failed to generate 3d conformers for 8.93% of molecules. 2023-12-25 18:11:00 | unimol/train.py | 88 | INFO | Uni-Mol(QSAR) | Output directory already exists: ./uv 2023-12-25 18:11:00 | unimol/train.py | 89 | INFO | Uni-Mol(QSAR) | Warning: Overwrite output directory: ./uv 2023-12-25 18:11:01 | unimol/models/unimol.py | 116 | INFO | Uni-Mol(QSAR) | Loading pretrained weights from /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/weights/mol_pre_all_h_220816.pt 2023-12-25 18:11:01 | unimol/models/nnmodel.py | 103 | INFO | Uni-Mol(QSAR) | start training Uni-Mol:unimolv1 val: 100%|██████████| 51/51 [00:14<00:00, 4.01it/s, Epoch=Epoch 1/20, loss=1.0418]

ValueError Traceback (most recent call last) Input In [3], in <cell line: 1>() ----> 1 clf.fit('/personal/updated_combined_data_uv2_float16.csv')

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/train.py:56, in MolTrain.fit(self, data) 54 self.trainer = Trainer(save_path=self.save_path, self.config) 55 self.model = NNModel(self.data, self.trainer, self.config) ---> 56 self.model.run() 57 scalar = self.data['target_scaler'] 58 y_pred = self.model.cv['pred']

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/models/nnmodel.py:120, in NNModel.run(self) 117 if fold > 0: 118 # need to initalize model for next fold training 119 self.model = self._init_model(**self.model_params) --> 120 _y_pred = self.trainer.fit_predict( 121 self.model, traindataset, validdataset, self.loss_func, self.activation_fn, self.save_path, fold, self.target_scaler) 122 y_pred[te_idx] = _y_pred 124 if 'multiclass_cnt' in self.data:

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/tasks/trainer.py:157, in Trainer.fit_predict(self, model, train_dataset, valid_dataset, loss_func, activation_fn, dump_dir, fold, target_scaler, feature_name) 154 batch_bar.close() 155 total_trn_loss = np.mean(trn_loss) --> 157 y_preds, val_loss, metric_score = self.predict( 158 model, valid_dataset, loss_func, activation_fn, dump_dir, fold, target_scaler, epoch, load_model=False, feature_name=feature_name) 159 end_time = time.time() 160 total_val_loss = np.mean(val_loss)

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/tasks/trainer.py:254, in Trainer.predict(self, model, dataset, loss_func, activation_fn, dump_dir, fold, target_scaler, epoch, load_model, feature_name) 252 inverse_y_preds = target_scaler.inverse_transform(y_preds) 253 inverse_y_truths = target_scaler.inverse_transform(y_truths) --> 254 metric_score = self.metrics.cal_metric( 255 inverse_y_truths, inverse_y_preds, label_cnt=label_cnt) if not load_model else None 256 else: 257 metric_score = self.metrics.cal_metric( 258 y_truths, y_preds, label_cnt=label_cnt) if not load_model else None

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/utils/metrics.py:197, in Metrics.cal_metric(self, label, predict, nan_value, threshold, label_cnt) 195 def cal_metric(self, label, predict, nan_value=-1.0, threshold=0.5, label_cnt=None): 196 if self.task in ['regression', 'multilabel_regression']: --> 197 return self.cal_reg_metric(label, predict, nan_value) 198 elif self.task in ['classification', 'multilabel_classification']: 199 return self.cal_classification_metric(label, predict, nan_value)

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/utils/metrics.py:175, in Metrics.cal_reg_metric(self, label, predict, nanvalue) 172 metric, , _ = metric_value 173 def nan_metric(label, predict): return cal_nan_metric( 174 label, predict, nan_value, metric) --> 175 res_dict[metric_type] = nan_metric(label, predict) 177 return res_dict

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/utils/metrics.py:173, in Metrics.cal_reg_metric..nan_metric(label, predict) --> 173 def nan_metric(label, predict): return cal_nan_metric( 174 label, predict, nan_value, metric)

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/utils/metrics.py:49, in cal_nan_metric(y_true, y_pred, nan_value, metric_func) 47 _mask = mask[:, i] 48 if not (~_mask).all(): ---> 49 result.append(metric_func( 50 y_true[:, i][_mask], y_pred[:, i][_mask])) 51 return np.mean(result)

File /opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py:63, in _deprecate_positional_args.._inner_deprecate_positional_args..inner_f(*args, *kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(args, **kwargs) 65 # extra_args > 0 66 args_msg = ['{}={}'.format(name, arg) 67 for name, arg in zip(kwonly_args[:extra_args], 68 args[-extra_args:])]

File /opt/conda/lib/python3.8/site-packages/sklearn/metrics/_regression.py:335, in mean_squared_error(y_true, y_pred, sample_weight, multioutput, squared) 274 @_deprecate_positional_args 275 def mean_squared_error(y_true, y_pred, *, 276 sample_weight=None, 277 multioutput='uniform_average', squared=True): 278 """Mean squared error regression loss. 279 280 Read more in the :ref:User Guide <mean_squared_error>. (...) 333 0.825... 334 """ --> 335 y_type, y_true, y_pred, multioutput = _check_reg_targets( 336 y_true, y_pred, multioutput) 337 check_consistent_length(y_true, y_pred, sample_weight) 338 output_errors = np.average((y_true - y_pred) ** 2, axis=0, 339 weights=sample_weight)

File /opt/conda/lib/python3.8/site-packages/sklearn/metrics/_regression.py:89, in _check_reg_targets(y_true, y_pred, multioutput, dtype) 55 """Check that y_true and y_pred belong to the same regression task. 56 57 Parameters (...) 86 the dtype argument passed to check_array. 87 """ 88 check_consistent_length(y_true, y_pred) ---> 89 y_true = check_array(y_true, ensure_2d=False, dtype=dtype) 90 y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype) 92 if y_true.ndim == 1:

File /opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py:63, in _deprecate_positional_args.._inner_deprecate_positional_args..inner_f(*args, *kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(args, **kwargs) 65 # extra_args > 0 66 args_msg = ['{}={}'.format(name, arg) 67 for name, arg in zip(kwonly_args[:extra_args], 68 args[-extra_args:])]

File /opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py:720, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 716 raise ValueError("Found array with dim %d. %s expected <= 2." 717 % (array.ndim, estimator_name)) 719 if force_all_finite: --> 720 _assert_all_finite(array, 721 allow_nan=force_all_finite == 'allow-nan') 723 if ensure_min_samples > 0: 724 n_samples = _num_samples(array)

File /opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py:103, in _assert_all_finite(X, allow_nan, msg_dtype) 100 if (allow_nan and np.isinf(X).any() or 101 not allow_nan and not np.isfinite(X).all()): 102 type_err = 'infinity' if allow_nan else 'NaN, infinity' --> 103 raise ValueError( 104 msg_err.format 105 (type_err, 106 msg_dtype if msg_dtype is not None else X.dtype) 107 ) 108 # for object dtype data, we only check for NaNs (GH-13254) 109 elif X.dtype == np.dtype('object') and not allow_nan:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').`

I am wondering how can I do that I had carefully clean and check my datasets and make sure that it is fine, but I do not know if the unimol can handle such a big datasets.

KaiChen-lr commented 6 months ago

I had the same problem.

Naplessss commented 6 months ago

Hi, you can use the latest image(unimol-qsar:v0.5), multi-task regression bugs should fixed in the latest version.

yxnyu commented 6 months ago

Hi, you can use the latest image(unimol-qsar:v0.5), multi-task regression bugs should fixed in the latest version.

Thanks you for your reply. I just try it but still failure. I use several method to research this problem. I find that when I set my value all to zero, unimol can calculate the validation. I try to use a 4000 molecules truncated version of my dataset to do such a work but failure. I found that when the bohrium shows the

The validation cannot work. I believe that all my data is in float32 and maybe there is the problem of datascaler? I had no idea how to solve it.

yxnyu commented 6 months ago

https://drive.google.com/file/d/1HmEPKFl6Vn5r6AY9_OzYTrrXR9CA2lXX/view?usp=drive_link Here is the csv link of truncated version.

HongshuaiWang1 commented 6 months ago

After our testing, it has been confirmed that v0.5 has removed the power transformer. Please check the image you selected again. In addition, we tested this data and found that this data is very sparse, so it is recommended to change the targetscaler to ‘none’ to avoid the problem of value overflow during the std standardization process.

yxnyu commented 6 months ago

After our testing, it has been confirmed that v0.5 has removed the power transformer. Please check the image you selected again. In addition, we tested this data and found that this data is very sparse, so it is recommended to change the targetscaler to ‘none’ to avoid the problem of value overflow during the std standardization process.

Thanks for your information. I also think the sparse problem has an impact on unimol. However, I just try the targetscaler to ‘none’ like

And the result seems like the same? If you have successfully test, would you like to share some tips or setting?

I comfirmed that v0.5 was successfully selected.

yxnyu commented 6 months ago

Thanks! I used the target_normalize='none' and the trainning is ok!

HongshuaiWang1 commented 6 months ago

After our testing, it has been confirmed that v0.5 has removed the power transformer. Please check the image you selected again. In addition, we tested this data and found that this data is very sparse, so it is recommended to change the targetscaler to ‘none’ to avoid the problem of value overflow during the std standardization process.

Thanks for your information. I also think the sparse problem has an impact on unimol. However, I just try the targetscaler to ‘none’ like And the result seems like the same? If you have successfully test, would you like to share some tips or setting?

I comfirmed that v0.5 was successfully selected.

Sorry. The true name of the targetsclaer interface parameters is 'target_normalize', you can change it to 'none'. I think it will work.
reg = MolTrain(...... target_normalize='none', ...... ）

We also try to develop the function to automatically handle overflow values to handle this situation well.

Naplessss commented 6 months ago

BTW, it's possible to enhance your data preprocessing by incorporating domain expertise, this may involve manual normalization of the target variable, anomaly detection, and other specialized techniques. unimol have some auto preprocess strategies but not cover much enough.

yxnyu commented 6 months ago

Thanks! And to such a big dataset, I am wondering if the unimol tool has the mulit-GPU version parameters?

deepmodeling / Uni-Mol

about an error in the multi-task regression #199