Open tiru1930 opened 4 years ago
@tiru1930 Sorry that you run into this problem.
s3://{bucket}/{prefix}/debug
Is the Iam role you used has the proper permission setup to access the debug S3 bucket s3://{bucket}/{prefix}/debug Yes, Could you show me the entire training job if possible? do you mean logs?
How long does the training job run? Did it successfully generated a model? it ran for few mins, and yes, it generated model and stored in s3, i was able to deploy
Could you show me the entire training job if possible? do you mean logs?
Yes. sorry for the delay. are you still experiencing this problem?
Hi, I'm having this issue as well. The model will train successfully and I can deploy it to an endpoint, but the "training_job_end.ts" file is empty.
Here's the estimator object I'm using:
xgb = sagemaker.estimator.Estimator(
container,
role,
instance_count=1,
instance_type='ml.m4.xlarge',
output_path=s3_xgb_output_location,
sagemaker_session=sagemaker_session,
hyperparameters=hyperparameters,
debugger_hook_config=DebuggerHookConfig(
s3_output_path=s3_xgb_output_location,
collection_configs=[
CollectionConfig(name="metrics", parameters={"save_interval": str(save_interval)}),
CollectionConfig(
name="feature_importance", parameters={"save_interval": str(save_interval)}
),
CollectionConfig(name="full_shap", parameters={"save_interval": str(save_interval)}),
CollectionConfig(name="average_shap", parameters={"save_interval": str(save_interval)}),
],
),
rules=[
Rule.sagemaker(
rule_configs.loss_not_decreasing(),
rule_parameters={
"collection_names": "metrics",
"num_steps": str(save_interval * 2),
},
),
],
)
Then, I try to get the smdebug trial artifacts:
s3_output_path = xgb.latest_job_debugger_artifacts_path()
trial = smd.create_trial(s3_output_path)
Exception: MissingCollectionFiles: Training job has ended. All the collection files could not be loaded
Describe the bug
Exception during rule evaluation: Customer Error: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded | Exception during rule evaluation: Customer Error: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded
To reproduce Train FrameWork Xgboost with debugger hook as below
Expected behavior I should get tensors saved in s3
Screenshots or logs
System information SageMaker Python SDK version: 2.6 Framework name (eg. PyTorch) or algorithm (eg. KMeans): xgboost frame work Framework version: 0.90-2 Python version: 3.8 CPU or GPU: CPU Custom Docker image (Y/N): N