`$output_dir` does not appear to have a file named `config.json`

kpouget commented 1 week ago

Describe the bug

When I run my granite fine-tuning jobs, I hit an error in the latest steps of the fine-tuning process.

Platform

OpenShift AI Container image: quay.io/modh/fms-hf-tuning:release-06f43ecf4d88c57018da9554c0baa6c4cf57d61a

Sample Code

pytorchjob.yaml.txt configmap_entrypoint.yaml.txt configmap_config.yaml.txt

Expected behavior

The fine-tuning job completes successfully.

Observed behavior

ERROR:root:Traceback (most recent call last):
  File "/app/accelerate_launch.py", line 141, in main
    tokenizer = AutoTokenizer.from_pretrained(checkpoint_dir)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 854, in from_pretrained
    config = AutoConfig.from_pretrained(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 976, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/configuration_utils.py", line 632, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/configuration_utils.py", line 689, in _get_config_dict
    resolved_config_file = cached_file(
                           ^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/utils/hub.py", line 373, in cached_file
    raise EnvironmentError(
OSError: /mnt/output/fine-tuning/ does not appear to have a file named config.json. Checkout 'https://huggingface.co//mnt/output/fine-tuning//tree/None' for available files.

Additional context

See this log file:

logs.txt

anhuong commented 1 week ago

From the configs linked, the error is that you are saving no checkpoints. You are not setting save_model_dir and are setting save_strategy="no" and thus no checkpoints are being saved and results in the error. This indeed is a bug and will be fixed in the upcoming release but if you pass in a save_model_dir it should also succeed

kpouget commented 1 week ago

Thanks @anhuong for taking a look, now that makes sense (we had save_strategy enabled for long in the lora test, and we added save_model_dir only recently after #311).

I tested the reproducer with this configuration flag, and it successed! pod.good.txt

just one last question:

This indeed is a bug and will be fixed in the upcoming release will this fix allow saving no checkpoint at all? (save_strategy=no+save_model_dir undefined) as part of our perf testing, we do not want to save anything at all [into permanent storage]

foundation-model-stack / fms-hf-tuning