HFAiLab / hai-platform

一种任务级GPU算力分时调度的高性能深度学习训练平台
https://hfailab.github.io/hai-platform/
GNU Lesser General Public License v3.0
298 stars 38 forks source link

Submit python script using hai-cli but failed #5

Open zzr93 opened 1 year ago

zzr93 commented 1 year ago

According to README.md, I deployed hai-platform and installed hai-cli successfully. "hai-cli init" using my token and url also succeed. However, when I try "hai-cli python /haidata/hai-platform/workspace/haiadmin/test.py -- -n 1", an error occured unexpectedly, here is the message

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/hfai/client/api/api_utils.py", line 101, in async_requests
    result = json.loads(result)
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/hai-cli", line 9, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.8/dist-packages/asyncclick/core.py", line 1159, in __call__
    return anyio.run(self._main, main, args, kwargs, **({"backend":_anyio_backend} if _anyio_backend is not None else {}))
  File "/usr/local/lib/python3.8/dist-packages/anyio/_core/_eventloop.py", line 68, in run
    return asynclib.run(func, *args, **backend_options)
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 204, in run
    return native_run(wrapper(), debug=debug)
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 199, in wrapper
    return await func(*args)
  File "/usr/local/lib/python3.8/dist-packages/asyncclick/core.py", line 1162, in _main
    return await main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/asyncclick/core.py", line 1083, in main
    rv = await self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/asyncclick/core.py", line 1693, in invoke
    return await _process_result(await sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/dist-packages/asyncclick/core.py", line 1429, in invoke
    return await ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/asyncclick/core.py", line 783, in invoke
    rv = await rv
  File "/usr/local/lib/python3.8/dist-packages/hfai/client/commands/hfai_python.py", line 294, in python
    await func_python_cluster(experiment_py, experiment_args, name, nodes, priority, group, image, environments,
  File "/usr/local/lib/python3.8/dist-packages/hfai/client/commands/hfai_python.py", line 255, in func_python_cluster
    await hfai_experiment.run.callback(config, follow, None, None, None)
  File "/usr/local/lib/python3.8/dist-packages/hfai/client/commands/hfai_experiment.py", line 167, in run
    experiment = await create_experiment(experiment_yml)
  File "/usr/local/lib/python3.8/dist-packages/hfai/client/api/experiment_api.py", line 444, in create_experiment
    result = await async_requests(RequestMethod.POST, url=f'{mars_url()}/operating/task/create?token={token}',
  File "/usr/local/lib/python3.8/dist-packages/hfai/client/api/api_utils.py", line 116, in async_requests
    raise Exception(f'请求失败: [exception: {str(e)}] [result: {result}]')
Exception: 请求失败: [exception: Expecting value: line 1 column 1 (char 0)] [result: Not Found]

It seems that server returns code 404 to the client on the task create url -> "{mars_url()}/operating/task/create?token={token})". I have no idea why this would happen.

Further information can be provided if needed. I am sure the token and url is correct since I can successfully init. I am also sure the test.py exists on the shared_filesystem otherwise hai-cli would report another error.

wenjun93 commented 1 year ago

404 means the provided token doesn't exist, could you please connect db by PGPASSWORD=${PG_PASSWORD} psql -h ${DB_IP} -p 5432 -U root mars_db.
DB_IP is the hai-platform container/service ip or your customized db ip if configured, PG_PASSWORD is "root" by dfault, check the output of select * from "users". I suppose the token doesn't exist in the table, in that case, please hai-cli init with the correct token in db again.

zzr93 commented 1 year ago

I couldn't found table "users", but I found "user" and "user_access_token" which may be related to this situation. I tried both tokens as below, and hai-cli accepts only access_token(which I used last week). So It seems I have already init with the correct token. Any other possible reasons?

mars_db=# select * from "user";
 user_id | user_name | nick_name |        token         |   role   | active |       last_activity        | shared_group 
---------+-----------+-----------+----------------------+----------+--------+----------------------------+--------------
   10020 | haiadmin  | haiadmin  | haiadmin             | internal | t      | 2023-07-13 19:06:12.531952 | hfai
   10000 | bff_admin | bff_admin | a69a81ca18b2712fc631 | internal | t      | 2023-07-13 19:06:12.531952 | hfai
(2 rows)

mars_db=# select * from "user_access_token";
 from_user_name | access_user_name |                                  access_token                                  | access_scope |      expire_at      |         created_at         |         updated_at         | created_by | deleted_by | active 
----------------+------------------+--------------------------------------------------------------------------------+--------------+---------------------+----------------------------+----------------------------+------------+------------+--------
 bff_admin      | bff_admin        | ACCESS-6255665f61646d696e236266665f61646d696e-ej5ZEZpxLNQzUiD3TBa1R26qknIwhi-F | all          | 3000-01-01 00:00:00 | 2023-07-13 19:10:46.015107 | 2023-07-13 20:48:55.507183 | bff_admin  |            | t
 haiadmin       | haiadmin         | ACCESS-68516961646d696e2368616961646d696e-E0lGXwIswnn0HpbXAW_tVRjga1wRjD0u     | all          | 3000-01-01 00:00:00 | 2023-07-13 19:10:41.729162 | 2023-07-19 09:59:28.134549 | haiadmin   |            | t
(2 rows)

mars_db=# \q
root@hai-platform-0:/# exit
root@xxx-node1:~# hai-cli init haiadmin --url http://xxx.com
发现原始 token,向 server 端申请注册 access token
向 server 端申请注册 access token 失败,保存原始 token
初始化成功, 目标配置 /root/.hfai/conf.yml, 配置如下: 
token: haiadmin
root@xxx-node1:~# hai-cli init ACCESS-68516961646d696e2368616961646d696e-E0lGXwIswnn0HpbXAW_tVRjga1wRjD0u --url http://xxx.com
初始化成功, 目标配置 /root/.hfai/conf.yml, 配置如下: 
token: ACCESS-68516961646d696e2368616961646d696e-E0lGXwIswnn0HpbXAW_tVRjga1wRjD0u
wenjun93 commented 1 year ago

the requests are sent to haproxy with operating server as backend, could you please also check the logfile in {HAI_PLATFORM_PATH}/log/operating_0.log to see if there is any abnormal, for example, requests not hitting the backend, the server reports any exception, etc.

yolunghiu commented 5 months ago

Hello, I encountered the same issue when submitting a task. Have you solved it? @wenjun93 @zzr93

yolunghiu commented 5 months ago

log in {HAI_PLATFORM_PATH}/log/operating_0.log

2024-04-10 15:02:36.579 | ERROR    |  | SpawnProcess-1 | [UserData] 订阅表 [user_with_all_groups] 失败
2024-04-10 15:02:36.579 | ERROR    |  | SpawnProcess-1 | [UserData] Reload table user_all_groups failed!
2024-04-10 15:02:36.579 | ERROR    |  | (psycopg2.OperationalError) connection to server at "127.0.0.1", port 15432 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?

(Background on this error at: https://sqlalche.me/e/14/e3q8)
Traceback (most recent call last):

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 3280, in _wrap_pool_connect
    return fn()
           └ <bound method Pool.connect of <sqlalchemy.pool.impl.QueuePool object at 0x7f615aa558b0>>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 310, in connect
    return _ConnectionFairy._checkout(self)
           │                │         └ <sqlalchemy.pool.impl.QueuePool object at 0x7f615aa558b0>
           │                └ <classmethod object at 0x7f615c825bb0>
           └ <class 'sqlalchemy.pool.base._ConnectionFairy'>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 868, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
            │                 │        └ <sqlalchemy.pool.impl.QueuePool object at 0x7f615aa558b0>
            │                 └ <classmethod object at 0x7f615c825b50>
            └ <class 'sqlalchemy.pool.base._ConnectionRecord'>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 476, in checkout
    rec = pool._do_get()
          │    └ <function QueuePool._do_get at 0x7f615c8418b0>
          └ <sqlalchemy.pool.impl.QueuePool object at 0x7f615aa558b0>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/impl.py", line 146, in _do_get
    self._dec_overflow()
    │    └ <function QueuePool._dec_overflow at 0x7f615c8419d0>
    └ <sqlalchemy.pool.impl.QueuePool object at 0x7f615aa558b0>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
    │      └ <function raise_ at 0x7f615d066d30>
    └ <module 'sqlalchemy.util.compat' from '/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py'>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
    return self._create_connection()
           │    └ <function Pool._create_connection at 0x7f615c81dd30>
           └ <sqlalchemy.pool.impl.QueuePool object at 0x7f615aa558b0>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 256, in _create_connection
    return _ConnectionRecord(self)
           │                 └ <sqlalchemy.pool.impl.QueuePool object at 0x7f615aa558b0>
           └ <class 'sqlalchemy.pool.base._ConnectionRecord'>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 371, in __init__
    self.__connect()
    └ <sqlalchemy.pool.base._ConnectionRecord object at 0x7f6154a86e50>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 666, in __connect
    pool.logger.debug("Error on connect(): %s", e)
    │    │      └ <function Logger.debug at 0x7f615f315160>
    │    └ <Logger sqlalchemy.pool.impl.QueuePool (WARNING)>
    └ <sqlalchemy.pool.impl.QueuePool object at 0x7f615aa558b0>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
    │      └ <function raise_ at 0x7f615d066d30>
    └ <module 'sqlalchemy.util.compat' from '/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py'>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/pool/base.py", line 661, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
    │    │                               │    │               └ <sqlalchemy.pool.base._ConnectionRecord object at 0x7f6154a86e50>
    │    │                               │    └ <function create_engine.<locals>.connect at 0x7f615c2ba3a0>
    │    │                               └ <sqlalchemy.pool.impl.QueuePool object at 0x7f615aa558b0>
    │    └ None
    └ <sqlalchemy.pool.base._ConnectionRecord object at 0x7f6154a86e50>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/create.py", line 590, in connect
    return dialect.connect(*cargs, **cparams)
           │       │        │        └ {'host': '127.0.0.1', 'database': 'mars_db', 'user': 'root', 'password': 'root', 'port': 15432, 'application_name': 'multi-se...
           │       │        └ []
           │       └ <function DefaultDialect.connect at 0x7f615c5b4ee0>
           └ <sqlalchemy.dialects.postgresql.psycopg2.PGDialect_psycopg2 object at 0x7f615c245a30>

  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 597, in connect
    return self.dbapi.connect(*cargs, **cparams)
           │    │     │        │        └ {'host': '127.0.0.1', 'database': 'mars_db', 'user': 'root', 'password': 'root', 'port': 15432, 'application_name': 'multi-se...
           │    │     │        └ ()
           │    │     └ <function connect at 0x7f615c1d1310>
           │    └ <module 'psycopg2' from '/usr/local/lib/python3.8/dist-packages/psycopg2/__init__.py'>
           └ <sqlalchemy.dialects.postgresql.psycopg2.PGDialect_psycopg2 object at 0x7f615c245a30>

  File "/usr/local/lib/python3.8/dist-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
           │        │                       │                     └ {}
           │        │                       └ None
           │        └ 'host=127.0.0.1 user=root password=root port=15432 application_name=multi-server-server dbname=mars_db'
           └ <built-in function _connect>

psycopg2.OperationalError: connection to server at "127.0.0.1", port 15432 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?

The above exception was the direct cause of the following exception:
...
yolunghiu commented 5 months ago

problem solved,see here https://github.com/HFAiLab/hai-platform/issues/12#issuecomment-2049179483