foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
Apache License 2.0
16 stars 38 forks source link

bug: AIM package being installed causes the trainer to expect the AIM server to be running. #131

Closed HarikrishnanBalagopal closed 3 weeks ago

HarikrishnanBalagopal commented 4 months ago

Describe the bug

If the AIM package is installed then the AIM server is also expected to be running.

  File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/storage/migrations/utils.py", line 28, in upgrade_database
    raise subprocess.SubprocessError(f'Database upgrade failed with exit code {exit_code}')
subprocess.SubprocessError: Database upgrade failed with exit code 1

It should allow running tests and training in environments where the AIM package is installed but there is no AIM server running.

Platform

Please provide details about the environment you are using, including the following:

Sample Code

Steps to reproduce:

  1. Create a conda or venv environment with the aim package and this library installed.
  2. Run https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/sft_trainer.py

Expected behavior

Training should complete without errors. Only expect an AIM server to be running if the appropriate environment variables like AIMSTACK_DB are set.

Observed behavior

Training fails with an error talking about AIM


  File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
    return func(*args, **kwargs)
  File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/run.py", line 859, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
  File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/run.py", line 272, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
  File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/base_run.py", line 34, in __init__
    self.repo = get_repo(repo)
  File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 24, in get_repo
    repo = Repo.from_path(repo, init=True)
  File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/repo.py", line 210, in from_path
    repo = Repo(path, read_only=read_only, init=init)
  File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/repo.py", line 153, in __init__
    self.structured_db.run_upgrades()
  File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/storage/structured/db.py", line 98, in run_upgrades
    upgrade_database(self.db_url)
  File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/storage/migrations/utils.py", line 28, in upgrade_database
    raise subprocess.SubprocessError(f'Database upgrade failed with exit code {exit_code}')
subprocess.SubprocessError: Database upgrade failed with exit code 1
dushyantbehl commented 3 months ago

@HarikrishnanBalagopal I believe this is solved now...can you close this.