GoogleCloudPlatform / ml-on-gcp

Machine Learning on Google Cloud Platform
Apache License 2.0
486 stars 178 forks source link

gpu_utilization_agent.service: Failed with result 'exit-code' #42

Closed gogasca closed 5 years ago

gogasca commented 5 years ago

Failed after ~1 hour in a n1-standard-16 with 4-V100 training planespotting:

python -m trainer_yolo.main --hp-layers 17 --tiledata "gs://planespotting-data-public/tiles_from_USGS_photos" --hp-evaluations 4 --hp-iterations 9400 --hp-batch-size 32 && date

Jan 30 18:58:37 v100-benchmark systemd[1]: Stopped GPU Utilization Metric Agent.
Jan 30 18:58:37 v100-benchmark systemd[1]: Started GPU Utilization Metric Agent.
Jan 30 18:58:37 v100-benchmark bash[23388]: mesg: ttyname failed: Inappropriate ioctl for device
Jan 30 18:58:38 v100-benchmark bash[23388]: Traceback (most recent call last):
Jan 30 18:58:38 v100-benchmark bash[23388]:   File "/root/report_gpu_metrics.py", line 116, in <module>
Jan 30 18:58:38 v100-benchmark bash[23388]:     main()
Jan 30 18:58:38 v100-benchmark bash[23388]:   File "/root/report_gpu_metrics.py", line 108, in main
Jan 30 18:58:38 v100-benchmark bash[23388]:     instance_id, zone, project_id)
Jan 30 18:58:38 v100-benchmark bash[23388]:   File "/root/report_gpu_metrics.py", line 58, in report_metric
Jan 30 18:58:38 v100-benchmark bash[23388]:     client.create_time_series(project_name, [series])
Jan 30 18:58:38 v100-benchmark bash[23388]:   File "/usr/local/lib/python2.7/dist-packages/google/cloud/monitoring_v3/gapic/metric_service_client.py", line 897, in create_time_se
ries
Jan 30 18:58:38 v100-benchmark bash[23388]:     request, retry=retry, timeout=timeout, metadata=metadata
Jan 30 18:58:38 v100-benchmark bash[23388]:   File "/usr/local/lib/python2.7/dist-packages/google/api_core/gapic_v1/method.py", line 143, in __call__
Jan 30 18:58:38 v100-benchmark bash[23388]:     return wrapped_func(*args, **kwargs)
Jan 30 18:58:38 v100-benchmark bash[23388]:   File "/usr/local/lib/python2.7/dist-packages/google/api_core/retry.py", line 270, in retry_wrapped_func
Jan 30 18:58:38 v100-benchmark bash[23388]:     on_error=on_error,
Jan 30 18:58:38 v100-benchmark bash[23388]:   File "/usr/local/lib/python2.7/dist-packages/google/api_core/retry.py", line 179, in retry_target
Jan 30 18:58:38 v100-benchmark bash[23388]:     return target()
Jan 30 18:58:38 v100-benchmark bash[23388]:   File "/usr/local/lib/python2.7/dist-packages/google/api_core/timeout.py", line 214, in func_with_timeout
Jan 30 18:58:38 v100-benchmark bash[23388]:     return func(*args, **kwargs)
Jan 30 18:58:38 v100-benchmark bash[23388]:   File "/usr/local/lib/python2.7/dist-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
Jan 30 18:58:38 v100-benchmark bash[23388]:     six.raise_from(exceptions.from_grpc_error(exc), exc)
Jan 30 18:58:38 v100-benchmark bash[23388]:   File "/usr/local/lib/python2.7/dist-packages/six.py", line 737, in raise_from
Jan 30 18:58:38 v100-benchmark bash[23388]:     raise value
Jan 30 18:58:38 v100-benchmark bash[23388]: google.api_core.exceptions.InvalidArgument: 400 One or more TimeSeries could not be written: One or more points were written more freq
uently than the maximum sampling period configured for the metric. {Metric: custom.googleapis.com/gpu_utilization, Timestamps: {Youngest Existing: '2019/01/30-10:58:33.791', New:
 '2019/01/30-10:58:38.189'}}: timeSeries[0]
gogasca commented 5 years ago

Need to increase API request #44