GoogleCloudPlatform / llm-pipeline-examples

Apache License 2.0
107 stars 26 forks source link

Add training cluster monitoring, that automatically restart cluster if training doesn't progress. #5

Closed kuo-lin closed 1 year ago

abdallag commented 1 year ago

Can you please merge from main to allow the updated checks to pass?