aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
https://aimstack.io
Apache License 2.0
4.94k stars 299 forks source link

Having problems using with fairseq #3063

Open henrycharlesworth opened 7 months ago

henrycharlesworth commented 7 months ago

❓Question

The library fairseq has built in support for aim, but I am struggling to get it working. I'm not sure if it's something I'm doing wrong or if maybe the fairseq support is out of date, but the fairseq repo is fairly inactive so I thought I would ask here.

I am working locally and run aim server, and see: "Server is mounted on 0.0.0.0:53800".

I then run my fairseq experiment, adding to my config.yaml file:

common:
   aim_repo: aim://0.0.0.0:53800

then run my experiment. It seems to be working initially - aim detects the experiment and the log starts with:

[2023-11-15 14:31:07,453][fairseq.logging.progress_bar][INFO] - Storing logs at Aim repo: aim://0.0.0.0:53800
[2023-11-15 14:31:07,480][aim.sdk.reporter][INFO] - creating RunStatusReporter for f6f19ecf0e2147b19e24d52f
[2023-11-15 14:31:07,482][aim.sdk.reporter][INFO] - starting from: {}
[2023-11-15 14:31:07,482][aim.sdk.reporter][INFO] - starting writer thread for <aim.sdk.reporter.RunStatusReporter object at 0x7f57117363e0>
[2023-11-15 14:31:08,471][fairseq.trainer][INFO] - begin training epoch 1
[2023-11-15 14:31:08,471][fairseq_cli.train][INFO] - Start iterating over samples
[2023-11-15 14:31:10,821][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
[2023-11-15 14:31:12,261][fairseq.trainer][INFO] - NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
[2023-11-15 14:31:12,261][fairseq_cli.train][INFO] - begin validation on "valid" subset
[2023-11-15 14:31:12,266][fairseq.logging.progress_bar][INFO] - Storing logs at Aim repo: aim://0.0.0.0:53800
[2023-11-15 14:31:12,283][fairseq.logging.progress_bar][INFO] - Appending to run: f6f19ecf0e2147b19e24d52f

but then I get an error:

...
  File "/lib/python3.10/site-packages/fairseq/logging/progress_bar.py", line 64, in progress_bar
    bar = AimProgressBarWrapper(
  File "/lib/python3.10/site-packages/fairseq/logging/progress_bar.py", line 365, in __init__
    self.run = get_aim_run(aim_repo, aim_run_hash)
  File "/lib/python3.10/site-packages/fairseq/logging/progress_bar.py", line 333, in get_aim_run
    return Run(run_hash=run_hash, repo=repo)
  File "/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
    _SafeModeConfig.exception_callback(e, func)
  File "/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
    raise e
  File "/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
    return func(*args, **kwargs)
  File "/lib/python3.10/site-packages/aim/sdk/run.py", line 828, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
  File "/lib/python3.10/site-packages/aim/sdk/run.py", line 276, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
  File "/lib/python3.10/site-packages/aim/sdk/base_run.py", line 50, in __init__
    self._lock.lock(force=force_resume)
  File "/lib/python3.10/site-packages/aim/storage/lock_proxy.py", line 38, in lock
    return self._rpc_client.run_instruction(self._hash, self._handler, 'lock', (force,))
  File "/lib/python3.10/site-packages/aim/ext/transport/client.py", line 260, in run_instruction
    return self._run_read_instructions(queue_id, resource, method, args)
  File "/lib/python3.10/site-packages/aim/ext/transport/client.py", line 285, in _run_read_instructions
    raise_exception(status_msg.header.exception)
  File lib/python3.10/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
    raise exception(*args) if args else exception()
TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'
Exception in thread Thread-13 (worker):
Traceback (most recent call last):
  File "lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 55, in worker
    if self._try_exec_task(task_f, *args):
  File "/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 81, in _try_exec_task
    task_f(*args)
  File "/lib/python3.10/site-packages/aim/ext/transport/client.py", line 301, in _run_write_instructions
    raise_exception(response.exception)
  File "/python3.10/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
    raise exception(*args) if args else exception()
aim.ext.transport.message_utils.UnauthorizedRequestError: 3310c526-aa51-47ef-ba87-fbf75f80f610

Does anyone have any idea what might be causing this/if there's something wrong with the approach I'm taking? I've tried with a variety of different aim versions (going back to the versions when fairseq was more actively being developed) and I still get errors.

SGevorg commented 7 months ago

Adding @tmynn to this thread as he has put the integration together.

alberttorosyan commented 7 months ago

@SGevorg, @henrycharlesworth, seems this line points to the real error:

TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'

@henrycharlesworth, are you using the latest version of Aim?

henrycharlesworth commented 7 months ago

I think so - using 3.17.5. I tried a number of earlier versions and this didn't seem to help.

sandeep-biddala commented 5 months ago

Is there any solution for this? I have been getting this error when I try retrieve an existing run with hash.