Closed kpouget closed 1 week ago
From the configs linked, the error is that you are saving no checkpoints. You are not setting save_model_dir
and are setting save_strategy="no"
and thus no checkpoints are being saved and results in the error. This indeed is a bug and will be fixed in the upcoming release but if you pass in a save_model_dir
it should also succeed
Thanks @anhuong for taking a look, now that makes sense (we had save_strategy
enabled for long in the lora
test, and we added save_model_dir
only recently after #311).
I tested the reproducer with this configuration flag, and it successed! pod.good.txt
just one last question:
This indeed is a bug and will be fixed in the upcoming release will this fix allow saving no checkpoint at all? (
save_strategy=no
+save_model_dir undefined
) as part of our perf testing, we do not want to save anything at all [into permanent storage]
Describe the bug
When I run my granite fine-tuning jobs, I hit an error in the latest steps of the fine-tuning process.
Platform
OpenShift AI Container image: quay.io/modh/fms-hf-tuning:release-06f43ecf4d88c57018da9554c0baa6c4cf57d61a
Sample Code
pytorchjob.yaml.txt configmap_entrypoint.yaml.txt configmap_config.yaml.txt
Expected behavior
The fine-tuning job completes successfully.
Observed behavior
Additional context
See this log file:
logs.txt