UrbsLab / STREAMLINE

Simple Transparent End-To-End Automated Machine Learning Pipeline for Supervised Learning in Tabular Binary Classification Data
https://urbslab.github.io/STREAMLINE/
GNU General Public License v3.0
67 stars 9 forks source link

Issue with Permutation Importance - "ValueError: assignment destination is read-only" #5

Open mycho830 opened 8 months ago

mycho830 commented 8 months ago

Hello,

Thank you for providing such a great tool. It has been incredibly helpful in my research. However, I recently encountered an issue after downloading the latest version.

When performing analysis, I encountered the following error during the phase 5 modeling, specifically: "ValueError: assignment destination is read-only."

I suspected a parallelization issue and modified the code by setting run_parallel=False, but the problem persists. Could you please provide any assistance or insights into resolving this issue?

Here's the code snippet I used: ` from streamline.runners.model_runner import ModelExperimentRunner model_exp = ModelExperimentRunner( output_path, experiment_name, algorithms=algorithms, exclude=exclude, class_label=class_label, instance_label=instance_label, scoring_metric=primary_metric, metric_direction=metric_direction, training_subsample=training_subsample, use_uniform_fi=use_uniform_FI, n_trials=n_trials, timeout=timeout, save_plots=False, do_lcs_sweep=do_lcs_sweep, lcs_nu=lcs_nu, lcs_n=lcs_N, lcs_iterations=lcs_iterations, lcs_timeout=lcs_timeout, resubmit=False)

model_exp.run(run_parallel=False)`

The error details are as follows: `-------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [24], in <cell line: 13>() 1 from streamline.runners.model_runner import ModelExperimentRunner 2 model_exp = ModelExperimentRunner( 3 output_path, experiment_name, algorithms=algorithms, 4 exclude=exclude, class_label=class_label, (...) 11 lcs_iterations=lcs_iterations, 12 lcs_timeout=lcs_timeout, resubmit=False) ---> 13 model_exp.run(run_parallel=False)

File /N/slate/minycho/tools/python/STREAMLINE/streamline/runners/model_runner.py:238, in ModelExperimentRunner.run(self, run_parallel) 236 job_list.append((job_obj, copy.deepcopy(model))) 237 else: --> 238 job_obj.run(model) 239 if run_parallel and run_parallel != "False" and not self.run_cluster: 240 # run_jobs(job_list) 241 Parallel(n_jobs=num_cores)( 242 delayed(model_runner_fn)(job_obj, model 243 ) for job_obj, model in tqdm(job_list))

File /N/slate/minycho/tools/python/STREAMLINE/streamline/modeling/modeljob.py:83, in ModelJob.run(self, model) 81 self.algorithm = model.small_name 82 logging.info('Running ' + str(self.algorithm) + ' on ' + str(self.train_file_path)) ---> 83 ret = self.run_model(model) 85 # Pickle all evaluation metrics for ML model training and evaluation 86 pickle.dump(ret, open(self.full_path 87 + '/model_evaluation/pickled_metrics/' 88 + self.algorithm + 'CV' + str(self.cv_count) + "_metrics.pickle", 'wb'))

File /N/slate/minycho/tools/python/STREAMLINE/streamline/modeling/modeljob.py:149, in ModelJob.run_model(self, model) 144 self.export_best_params(self.full_path + '/models/' + self.algorithm + 145 '_usedparams' + str(self.cv_count) + '.csv', 146 model.params) 148 if self.uniform_fi: --> 149 results = permutation_importance(model.model, x_train, y_train, n_repeats=10, random_state=self.random_state, 150 scoring=self.scoring_metric) 151 self.feature_importance = results.importances_mean 152 else:

File ~/.local/lib/python3.10/site-packages/sklearn/inspection/_permutation_importance.py:258, in permutation_importance(estimator, X, y, scoring, n_repeats, n_jobs, random_state, sample_weight, max_samples) 254 scorer = _MultimetricScorer(scorers=scorers_dict) 256 baseline_score = _weights_scorer(scorer, estimator, X, y, sample_weight) --> 258 scores = Parallel(n_jobs=n_jobs)( 259 delayed(_calculate_permutation_scores)( 260 estimator, 261 X, 262 y, 263 sample_weight, 264 col_idx, 265 random_seed, 266 n_repeats, 267 scorer, 268 max_samples, 269 ) 270 for col_idx in range(X.shape[1]) 271 ) 273 if isinstance(baseline_score, dict): 274 return { 275 name: _create_importances_bunch( 276 baseline_score[name], (...) 280 for name in baseline_score 281 }

File ~/.local/lib/python3.10/site-packages/sklearn/utils/parallel.py:63, in Parallel.call(self, iterable) 58 config = get_config() 59 iterable_with_config = ( 60 (_with_config(delayed_func, config), args, kwargs) 61 for delayed_func, args, kwargs in iterable 62 ) ---> 63 return super().call(iterable_with_config)

File ~/.local/lib/python3.10/site-packages/joblib/parallel.py:1863, in Parallel.call(self, iterable) 1861 output = self._get_sequential_output(iterable) 1862 next(output) -> 1863 return output if self.return_generator else list(output) 1865 # Let's create an ID that uniquely identifies the current call. If the 1866 # call is interrupted early and that the same instance is immediately 1867 # re-used, this id will be used to prevent workers that were 1868 # concurrently finalizing a task from the previous call to run the 1869 # callback. 1870 with self._lock:

File ~/.local/lib/python3.10/site-packages/joblib/parallel.py:1792, in Parallel._get_sequential_output(self, iterable) 1790 self.n_dispatched_batches += 1 1791 self.n_dispatched_tasks += 1 -> 1792 res = func(*args, **kwargs) 1793 self.n_completed_tasks += 1 1794 self.print_progress()

File ~/.local/lib/python3.10/site-packages/sklearn/utils/parallel.py:123, in _FuncWrapper.call(self, args, kwargs) 121 config = {} 122 with config_context(config): --> 123 return self.function(args, **kwargs)

File ~/.local/lib/python3.10/site-packages/sklearn/inspection/_permutation_importance.py:62, in _calculate_permutation_scores(estimator, X, y, sample_weight, col_idx, random_state, n_repeats, scorer, max_samples) 60 X_permuted[X_permuted.columns[col_idx]] = col 61 else: ---> 62 X_permuted[:, col_idx] = X_permuted[shuffling_idx, col_idx] 63 scores.append(_weights_scorer(scorer, estimator, X_permuted, y, sample_weight)) 65 if isinstance(scores[0], dict):

ValueError: assignment destination is read-only`

Your help on this matter would be greatly appreciated.

Thank you, Min

ryanurbs commented 8 months ago

Hi Min, Thanks for informing us of your issue. We'll attempt to track down the problem and get back to you. We may reach out to try and get more information about your run.

raptor419 commented 8 months ago

Hi Min,

Thank you so much for informing us of your issue. This seems to be a highly specific issue within scikit-learn and we will do our best to replicate and correct it so this doesn't affect any future analysis. I would love any information about any versions of the packages and size of the datasets you're using. For now, would it be fair to assume you gave the latest versions for all packages and a fairly large dataset?

Your localization of the issue to parallelization is highly helpful and seems to be a great step in the correct direction. It seems to be an issue with internal parallelization within sklearn. While I did not find the exact issue on GitHub in sklearn repositories for the permutation_importance function I found a very similar issue ( scikit-learn/scikit-learn#5956 ) dealing with the same error. https://github.com/scikit-learn/scikit-learn/issues/5956

As a step-wise solution I would try the following:

  1. Adding n_jobs = 1 to the permutation_importance function in line 149 of modeljob.py, instead of the default None (which does result in 1, but we should try anyway).
  2. Update joblib library through conda/pip
  3. Trying the solution mentioned scikit-learn/scikit-learn#5956 : https://github.com/scikit-learn/scikit-learn/issues/5956#issuecomment-422487505

Do let me know if I am making fair assumptions or if I am wrong somewhere and if any of the steps above seem to have helped out. Feel free to reach out if you have any more queries or questions.

Thanks and regards, Harsh

mycho830 commented 8 months ago

Hi Harsh,

Thank you for your response. Your assumptions are accurate – we are using the latest versions, and we use 5 input files from 4.7k to 55k.

I followed your step-wise solution, and adding n_jobs = 1 to the permutation_importance function on line 149 of modeljob.py did the trick! The issue seems to be resolved, and the analysis is running smoothly without encountering the parallelization error.

I sincerely appreciate your assistance and quick resolution.

Thanks once again, and have a great day!

Best regards, Min