comet-ml / issue-tracking

Questions, Help, and Issues for Comet ML
https://www.comet.ml
85 stars 7 forks source link

joblib "Broken Pipe" using scikit-learn grid-search crossfold validation after importing Comet ML #543

Closed DFuller134 closed 3 months ago

DFuller134 commented 5 months ago

Describe the Bug

After importing comet_ml a scikit-learn-based training script fails during sklearn grid search cross-validation: "broken pipe" exception in joblib. Works fine without import of comet_ml.

Expected behavior

Training script should execute to completion with import of comet_ml

Where is the issue?

Third Party Integrations (scikit-learn). Stack trace indicates calls into comet_ml monkey-patching.

To Reproduce

Steps to reproduce the behavior:

  1. import comet_ml
  2. instantiate a Comet ML experiment
  3. some exp.log... statements
  4. instantiate scikit-learn GridSearchCV and fit to initiate

Stack Trace

Fitting 5 folds for each of 36 candidates, totalling 180 fits
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
    r = call_item()
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 598, in __call__
    return [func(*args, **kwargs)
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 598, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 129, in __call__
    return self.function(*args, **kwargs)
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 949, in _fit_and_score
    print(end_msg)
BrokenPipeError: [Errno 32] Broken pipe
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/redacted/projects/redacted/redacted/redacted/redacted-final_model.py", line 994, in <module>
    main()
  File "/home/redacted/projects/redacted/redacted/redacted/redacted-final_model.py", line 972, in main
    run_experiment(
  File "/home/redacted/projects/redacted/redacted/redacted/redacted-final_model.py", line 771, in run_experiment
    pred_df, y_test, best_grid_rgr, X_train, X, y = run_xgb(df, pre_pipe, post_pipe, params)
  File "/home/redacted/projects/redacted/redacted/redacted/redacted-final_model.py", line 740, in run_xgb
    grid = grid.fit(X_train, y_train)
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/comet_ml/monkey_patching.py", line 316, in wrapper
    raise exception_raised
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/comet_ml/monkey_patching.py", line 287, in wrapper
    return_value = original(*args, **kwargs)
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 970, in fit
    self._run_search(evaluate_candidates)
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 1527, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 916, in evaluate_candidates
    out = parallel(
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 67, in __call__
    return super().__call__(iterable_with_config)
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 2007, in __call__
    return output if self.return_generator else list(output)
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1650, in _get_outputs
    yield from self._retrieve()
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1754, in _retrieve
    self._raise_error_fast()
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1789, in _raise_error_fast
    error_job.get_result(self.timeout)
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 745, in get_result
    return self._return_or_raise()
  File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 763, in _return_or_raise
    raise self._result
BrokenPipeError: [Errno 32] Broken pipe
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     name                  : outside_cheese_2516

Comet Debug Log

comet.log

Screenshots or GIFs

N/A

Additional context (code fragment - fails on grid.fit)

    # Instantiate & Fit Grid Search Object
    grid = GridSearchCV(rgr, params, cv=5, n_jobs=-1, scoring=scoring, verbose=5)
    grid = grid.fit(X_train, y_train)
dsblank commented 5 months ago

Looking through your log (search for "Traceback") I see this issue:

comet_ml.vendor.nvidia_ml.pynvml.NVMLError_NotSupported: Not Supported

but that shouldn't cause any issues. I also see:

[[13.3],
       [33.9],
       [54.5],
       [75.1],
       [95.7]]
ValueError: can only convert an array of size 1 to a Python scalar

which could be a Comet bug.

Also:

ModuleNotFoundError: No module named 'graphviz'

Pip install graphviz (or another dot package) to see if that helps.

DFuller134 commented 5 months ago

I addressed each of these issues except the NVML error (related to GPU drivers likely needed to log GPU metrics). The ValueError cleared up when I set COMET_DISABLE_AUTO_LOGGING=1. I also installed graphviz to clear up that issue.

I would agree that the ValueError seems like a CometML bug.

dsblank commented 5 months ago

@DFuller134 thanks for your update! I'll pass on the details of the NVMLError_NotSupported error to our engineering team.

dsblank commented 5 months ago

Do you know what [[33.9], [54.5], [75.1], [95.7]] is? If you are trying to log a parameter (or step or epoch) value, it can't be a list of values. I believe that these are the only places that this error could come from.