aws-samples / sagemaker-101-workshop

Hands-on demonstrations for data scientists exploring Amazon SageMaker
77 stars 48 forks source link

[Built-in algos] Need to convert one-hot variables to numerics #36

Closed athewsey closed 3 months ago

athewsey commented 7 months ago

Perhaps some Pandas behaviour changed? But currently the following line in notebook 1 Autopilot and XGBoost.ipynb:

df_model_data = pd.get_dummies(df_model_data)  # Convert categorical variables to sets of indicators

...is yielding boolean typed columns for all the one-hot encoded variables. This is consistent with the current pandas doc, and the notebook seems to train the XGBoost model fine - but the XGBoost evaluation Batch Transform job fails with:

RuntimeError: Loading csv data failed with Exception, please ensure data is in csv format:
 <class 'ValueError'>
 could not convert string to float: 'False'

I believe we need to add , dtype=int to the get_dummies() call to ensure the generated train/val/test datasets are fully numeric to be compatible with the SageMaker XGBoost algorithm. Haven't quite finished testing it through yet though.

athewsey commented 3 months ago

Fixed in linked PR