aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
9.98k stars 6.73k forks source link

[Bug Report] `introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.ipynb` fails during training w/ data load error #2921

Open aduriseti opened 3 years ago

aduriseti commented 3 years ago

Link to the notebook https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.ipynb

introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.ipynb

Describe the bug Notebook fails during training step. Inspecting job failure reason gives:

'AlgorithmError: framework error: \nTraceback (most recent call last):\n  File "/miniconda3/lib/python3.7/site-packages/sagemaker_xgboost_container/data_utils.py", line 414, in _get_parquet_dmatrix_pipe_mode\n    for record in reader:\nmlio.CorruptHeaderError: The record does not start with the Parquet magic number.\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train\n    entrypoint()\n  File "/miniconda3/lib/python3.7/site-packages/sagemaker_xgboost_container/training.py", line 94, in main\n    train(framework.training_env())\n  File "/miniconda3/lib/python3.7/site-packages/sagemaker_xgboost_container/training.py", line 90, in train\n    run_algorithm_mode()\n  File "/miniconda3/lib/python3.7/site-packages/sagemaker_xgboost_container/training.py", line 68, in run_algorithm_mode\n    checkpoint_config=checkpoint_config\n  File "/miniconda3/lib/python3.7/site-packages/sag'

To reproduce Run notebook - after failure inspect job w/ client.describe_training_job(TrainingJobName=job_name)["FailureReason"]

Logs If applicable, add logs to help explain your problem. You may also attach an .ipynb file to this issue if it includes relevant logs or output. xgboost_parquet_input_training.pdf

eitansela commented 2 years ago

Run the same notebook on SageMaker Notebook. Training completes successfully. Error is not reproduced.