aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
10.15k stars 6.78k forks source link

[Bug Report] SageMaker Pipelines Notebook - XGBoost is first installed with anaconda then upgraded with pip. #4779

Open kmanuwai opened 1 week ago

kmanuwai commented 1 week ago

Link to the notebook https://github.com/aws/amazon-sagemaker-examples/blob/cddb473cc79c2eaae5d7fb79c456280cc5d6471d/%20%20%20ml_ops/sm-pipelines_batch_inference_step_decorator/sm-pipelines_batch_inference_step_decorator.ipynb

Describe the bug Pipeline execution fails at Training Step, due to two different version installations of XGboost. One by Conda from the Sagemaker Distribution Image. The other from Pip in the requirements.txt.

To reproduce Run the notebook on SageMaker Studio. Distribution Image 2.1.0.

Potential Fix identified Works if we run on SageMaker Distribution Image 1.11. However this is not the default anymore, so customer will run into this issue more often.

  1. Update XGboost in requirements.txt to 2.1.1.
  2. Move early_stopping_rounds=5 from xgb.fit() to XGBClassifier() Like below:
    xgb = XGBClassifier(n_estimators=num_round, early_stopping_rounds=5,  **param)
    xgb.fit(
        train_df,
        y_train,
        eval_set=[(validation_df, y_validation)],
    )
  3. Now need to fix Evaluate step. Errors with ModelBuilder.

Logs

Traceback (most recent call last): File "/opt/conda/lib/python3.11/site-packages/sagemaker/remote_function/invoke_function.py", line 144, in main _execute_remote_function( File "/opt/conda/lib/python3.11/site-packages/sagemaker/remote_function/invoke_function.py", line 119, in _execute_remote_function stored_function.load_and_invoke() File "/opt/conda/lib/python3.11/site-packages/sagemaker/remote_function/core/stored_function.py", line 189, in load_and_invoke result = func(*resolved_args, **resolved_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/tmp/ipykernel_1942/1143899264.py", line 26, in train File "/opt/conda/lib/python3.11/site-packages/xgboost/init.py", line 7, in from . import collective, dask, rabit File "/opt/conda/lib/python3.11/site-packages/xgboost/collective.py", line 12, in from .core import _LIB, _check_call, c_str, py_str, from_pystr_to_cstr File "/opt/conda/lib/python3.11/site-packages/xgboost/core.py", line 264, in _LIB = _load_lib() ^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/xgboost/core.py", line 258, in _load_lib raise ValueError(msg)

ValueError: Mismatched version between the Python package and the native shared object. Python package version: 1.7.1. Shared object version: 2.1.1. Shared object is loaded from: /opt/conda/lib/libxgboost.so. Likely cause: * XGBoost is first installed with anaconda then upgraded with pip. To fix it please remove one of the installations.

After Doing steps 1 and 2 of the fix above, get this error in the Evaluate step.

ValueError: Unable to auto detect a DLC for framework xgboost, framework version py311 and python version 2.1.1. Please manually provide image_uri to ModelBuilder()

Note: I am an AWS employee. Please feel free to message internally.