ahmedmalaa / AutoPrognosis

Codebase for "AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization", ICML 2018.
56 stars 22 forks source link

R runtime error: dim(X) must have a positive length #6

Closed mojgan-ph closed 4 years ago

mojgan-ph commented 4 years ago

Hi Ahmed,

I have managed to install and run AutoPrognosis on the sample data that you have used for the toturial, but it gives me an error on my dataset. The error that I get follows. Do you have any suggestions?

I would also like to know if your UCLA email address is still valid. I sent you an email to that address around a month ago. Have you seen it?

Best, Mojgan

RRuntimeError Traceback (most recent call last)

in 6 acquisition_type=acquisition_type) 7 ----> 8 AP_mdl.fit(X_, Y_) ~/.../AutoPrognosis-master/alg/autoprognosis/model.py in fit(self, X, Y) 461 rval_model = self.get_model(self.domains_[u],u,self.compons_,x_next[u]) 462 --> 463 y_next_, modb_, eva_prp = self.evaluate_CV_objective(X.copy(), Y.copy(), rval_model) 464 465 eva_prp['iter'] = current_iter ~/.../AutoPrognosis-master/alg/autoprognosis/model.py in evaluate_CV_objective(self, X_in, Y_in, modraw_) 343 #mod_back.fit(X_in, Y_in) 344 --> 345 rval_eva = evaluate_clf(X_in.copy(), Y_in.copy(), copy.deepcopy(modraw_), n_folds = self.CV) 346 logger.info('CV_objective:{}'.format(rval_eva)) 347 f = -1*rval_eva[0][0] ~/.../AutoPrognosis-master/alg/autoprognosis/model.py in evaluate_clf(X, Y, model_input, n_folds, visualize) 936 if is_pred_proba: 937 logger.info('+fit {} {}'.format(X_train.shape,list(set(np.ravel(Y_train))))) --> 938 model.fit(X_train, Y_train) 939 preds = model.predict(X_test) 940 nnan = sum(np.ravel(np.isnan(preds))) ~/.../AutoPrognosis-master/alg/autoprognosis/pipelines/basePipeline.py in fit(self, X, Y, **kwargs) 108 if hasattr(self.model_list[u], 'fit_transform'): # This should be just a transform 109 --> 110 X_temp = np.array(self.model_list[u].fit_transform(X_temp)).copy() 111 112 else: ~/.../AutoPrognosis-master/alg/autoprognosis/models/imputers.py in fit_transform(self, X) 294 def fit_transform(self, X): 295 --> 296 return self.model.fit(X) 297 298 def get_hyperparameter_space(self): ~/.../AutoPrognosis-master/alg/autoprognosis/models/imputers.py in fit(self, X) 240 self.init_r_sytem() 241 --> 242 r(r_command) 243 X = r.X 244 ~/.../lib/python3.7/site-packages/rpy2/robjects/__init__.py in __call__(self, string) 387 def __call__(self, string): 388 p = _rparse(text=StrSexpVector((string,))) --> 389 res = self.eval(p) 390 return conversion.rpy2py(res) 391 ~/.../lib/python3.7/site-packages/rpy2/robjects/functions.py in __call__(self, *args, **kwargs) 190 kwargs[r_k] = v 191 return (super(SignatureTranslatedFunction, self) --> 192 .__call__(*args, **kwargs)) 193 194 ~/.../lib/python3.7/site-packages/rpy2/robjects/functions.py in __call__(self, *args, **kwargs) 119 else: 120 new_kwargs[k] = conversion.py2rpy(v) --> 121 res = super(Function, self).__call__(*new_args, **new_kwargs) 122 res = conversion.rpy2py(res) 123 return res ~/.../lib/python3.7/site-packages/rpy2/rinterface_lib/conversion.py in _(*args, **kwargs) 26 def _cdata_res_to_rinterface(function): 27 def _(*args, **kwargs): ---> 28 cdata = function(*args, **kwargs) 29 # TODO: test cdata is of the expected CType 30 return _cdata_to_rinterface(cdata) ~/.../lib/python3.7/site-packages/rpy2/rinterface.py in __call__(self, *args, **kwargs) 783 error_occured)) 784 if error_occured[0]: --> 785 raise embedded.RRuntimeError(_rinterface._geterrmessage()) 786 return res 787 RRuntimeError: Error in apply(is.na(xmis), 2, sum) : dim(X) must have a positive length Calls: -> -> missForest -> apply
ahmedmalaa commented 4 years ago

Hi mojgan-ph,

Can you please let me know what is the data type and dimensions for the input variable X_?

Thanks.

mojgan-ph commented 4 years ago

It is a pandas data frame of shape (412989, 17) The info for this data frame is as follows:

Int64Index: 412989 entries, 1754275 to 5058762 Data columns (total 17 columns):

0 age-high-bp-diagnosed 412989 non-null float64 1 average-dias-0 412989 non-null float64 2 average-sys-0 412989 non-null float64 3 average-pulse-0 412989 non-null float64 4 history-of-diabetes 412989 non-null bool
5 gender 412989 non-null int64
6 age-0 412989 non-null float64 7 hypertention-medication-0 412989 non-null bool
8 mother-smoker 412989 non-null float64 9 smoker 412989 non-null bool
10 ex-smoker 412989 non-null bool
11 non-smoker 412989 non-null bool
12 amount-combined 412989 non-null float64 13 ex-penalty 412989 non-null float64 14 average-BMI-0 412989 non-null float64 15 diff-age-and-agehighbpdiagnosed 412989 non-null float64 16 diff-blood-pressures 412989 non-null float64 dtypes: bool(5), float64(11), int64(1) memory usage: 42.9 MB

ahmedmalaa commented 4 years ago

Thanks Mojgan. I think this issue is caused by the R wrapper on top of the missForest algorithm. I recommend you do imputation externally using any imputer (e.g. MICE) and then apply AP while turning the imputation option off. I will further investigate this bug and fix it in the next update.

mojgan-ph commented 4 years ago

Can you please let me know how to turn the imputation option off? Is there a manual for Autoprognosis that I can read?

ahmedmalaa commented 4 years ago

You can set is_nan=False in the instantiation of the AutoPrognosis_Classifier object.

mojgan-ph commented 4 years ago

Thank you :)

mojgan-ph commented 4 years ago

Hi Ahmed, It is more than 4 hours that Autoprognosis is running. I am wondering can that be normal? Is there a way I can see the progress? Is there an option for verbose logging? It only printed the following shortly after the start of the run: Screen Shot 2020-05-06 at 2 26 22 pm

I have made the classifier object like what you have done in the tutorial, adding is_nan=False: AP_mdl = model.AutoPrognosis_Classifier( metric=metric, CV=5, num_iter=3, kernel_freq=100, ensemble=True, ensemble_size=3, Gibbs_iter=100, burn_in=50, num_components=3, acquisition_type=acquisition_type, is_nan=False)

I also need some help to understand the parameters that the classifier constructor needs. In your paper I can see that you set AutoPrognosis to conduct 200 iterations of the Bayesian optimization procedure. Is that set by num_iter, or Gibbs_iter? what are kernel_freq, num_components and burn_in?

Best, Mojgan

ahmedmalaa commented 4 years ago

Hi Mojgan,

Based on the size of your data set, your experiment will likely need to run for multiple days of you use a large number of iterations (num_iter). You may speed up the process by reducing the number of iterations. However, if you are keeping num_iter to be 3 then your experiment should probably be done within one day.

I am not sure which paper you are referring to, but in all my medical papers I was using a very different earlier version of this algorithm that had different parameters do not match here. But you can consider the number of iterations of the Bayesian optimization procedure to be the num_iter parameter.

Thanks.

ahmedmalaa commented 4 years ago

Hi Mojgan,

Based on the size of your data set, your experiment will likely need to run for multiple days if you use a large number of iterations (num_iter). You may speed up the process by reducing the number of iterations. However, if you are keeping num_iter to be 3 then your experiment should probably be done within one day.

I am not sure which paper you are referring to, but in all my medical papers I was using a very different earlier version of this algorithm that had different parameters, so parameters settings do not necessarily match with how they are defined in this version. But you can consider the number of iterations of the Bayesian optimization procedure to be the num_iter parameter.

Thanks.

Ahmed


From: mojgan-ph notifications@github.com Sent: Tuesday, May 5, 2020 9:39 PM To: ahmedmalaa/AutoPrognosis AutoPrognosis@noreply.github.com Cc: Ahmed M. Alaa a7med3laa@hotmail.com; Comment comment@noreply.github.com Subject: Re: [ahmedmalaa/AutoPrognosis] R runtime error: dim(X) must have a positive length (#6)

Hi Ahmed, It is almost 4 hours that Autoprognosis is running. I am wondering can that be normal? I have made the classifier object like what you have done in the tutorial, adding is_nan=False: AP_mdl = model.AutoPrognosis_Classifier( metric=metric, CV=5, num_iter=3, kernel_freq=100, ensemble=True, ensemble_size=3, Gibbs_iter=100, burn_in=50, num_components=3, acquisition_type=acquisition_type, is_nan=False)

I also need some help to understand the parameters that the classifier constructor needs. In your paper I can see that you set AutoPrognosis to conduct 200 iterations of the Bayesian optimization procedure. Is that set by num_iter, or Gibbs_iter? what are kernel_freq, num_components and burn_in?

Best, Mojgan

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/ahmedmalaa/AutoPrognosis/issues/6#issuecomment-624437565, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFBNR5SXWPVCV6DBYERZZR3RQDSXHANCNFSM4MYP4ZAA.

mojgan-ph commented 4 years ago

Thank you for the clarification.

Best, Mojgan

mojgan-ph commented 4 years ago

Hi Ahmed,

In order to do faster runs of the tool, I have made some changes to AutoPrognosis for myself, to only include a few classification algorithms. I have forked your repo. It is not a stable version yet. I would appreciate any design/usage documents if you have any, to help me with my changes. It would also be great if I could have your email address so that I ask my questions directly.

I would also appreciate any documents that explains the report file. I made a run for tuning a few classifiers, and the final report looks like what follows. I am quite confused what these mean.

Best, Mojgan

**Score

classifier aucroc 0.721 classifier aucprc 0.060 ensemble aucroc 0.721 ensemble aucprc 0.059

Report

best score single pipeline (while fitting) 0.718 model_names_single_pipeline [ Gradient Boosting ] best ensemble score (while fittng) 0.719 ensemble_pipelines ['[ Gradient Boosting ]', '[ XGBoost ]', '[ Gradient Boosting ]'] ensemble_pipelines_weight [0.2865448126747815, 0.42185017656977897, 0.2916050107554396] ... acquisition_type LCB kernel_members 0 ['Gradient Boosting'] kernel_members 1 ['Adaboost'] kernel_members 2 ['Neural Network', 'XGBoost', 'Random Forest'] ... Average performance per classifier (ignoring hyperparameters):

0 Gradient Boosting 100 0.676 0.050 1 XGBoost 31 0.671 0.049 2 Random Forest 39 0.670 0.051 3 AdaBoost 100 0.646 0.044 4 NeuralNet 30 0.500 0.022**