GoogleCloudPlatform / llm-pipeline-examples

Apache License 2.0
107 stars 26 forks source link

Add training cluster monitoring, that automatically restart cluster if training doesn't progress. #4

Closed kuo-lin closed 1 year ago