FederatedAI / KubeFATE

Manage federated learning workload using cloud native technologies.
Apache License 2.0
418 stars 222 forks source link

upgrade doc is out-of-date and has an issue in upgrade sql(1.8.0-1.9.0.sql) #880

Open wood-j opened 1 year ago

wood-j commented 1 year ago

Is your feature request related to a problem? Please describe. No. The version upgrade doc is out of date and may not working properly.

Describe the solution you'd like

I am tring to use doc to upgrade my cluster(on k8s) from 1.8.0 to 1.9.2 with all persistent data.

But as the cluster.yaml of 1.9.2 has imported some new config lines, like:

# Computing : Eggroll, Spark, Spark_local
computing: Eggroll
# Federation: Eggroll(computing: Eggroll), Pulsar/RabbitMQ(computing: Spark/Spark_local)
federation: Eggroll
# Storage: Eggroll(computing: Eggroll), HDFS(computing: Spark), LocalFS(computing: Spark_local)
storage: Eggroll
# Algorithm: Basic, NN
algorithm: Basic
# Device: CPU, IPCL
device: CPU

The basic guide to update is:

chartVersion: v1.8.0      ->   chartVersion: v1.9.2

The upgrade task won't post as those new config is missing.

Describe alternatives you've considered

Maybe we need to clear what to update in cluster.yaml for each version.

Additional context

And the 1.9.2 changed some pod from deploy to statefull set, the pvc created from chart has changed, the persistent data (path) need to be manually migrated to new path too.

wood-j commented 1 year ago

By the way, If fum only does 2 thing as the doc said:

image

Rows in mysql eggroll_meta.server_node is incorrect after upgrades and pods up:

image

Which cound lead to follwing error running toy test:

image

[ERROR] [2023-04-25 09:09:31,735] [202304250903258499330] [1940:139857564890944] - [task_executor._run_] [line:265]: processor in session meta is not valid: <ErSessionMeta(id=202304250903258499330_secure_add_example_0_0_host_10000, name=, status=NEW, tag=, processors=[***, len=2], options=[{'python.venv': '/data/projects/python/venv', 'eggroll.session.processors.per.node': '1', 'eggroll.session.deploy.mode': 'cluster', 'python.path': '/data/projects/fate/fate/python:$PYTHONPATH:/data/projects/fate/fate/python:/data/projects/fate/eggroll/python:/data/projects/fate/fateflow/python:/data/projects/fate/fate/python/fate_client', 'eggroll.rollpair.inmemory_output': 'True'}]) at 0x7f32fb65de50>
Traceback (most recent call last):
  File "/data/projects/fate/fateflow/python/fate_flow/worker/task_executor.py", line 148, in _run_
    sess.init_computing(computing_session_id=args.session_id, options=session_options)
  File "/data/projects/fate/fate/python/fate_arch/session/_session.py", line 118, in init_computing
    self._computing_session = CSession(
  File "/data/projects/fate/fate/python/fate_arch/computing/eggroll/_csession.py", line 38, in __init__
    self._rp_session = session_init(session_id=session_id, options=options)
  File "/data/projects/fate/eggroll/python/eggroll/core/session.py", line 42, in session_init
    er_session = ErSession(session_id=session_id, options=options)
  File "/data/projects/fate/eggroll/python/eggroll/core/session.py", line 199, in __init__
    self.__session_meta = self._cluster_manager_client.get_or_create_session(session_meta)
  File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 185, in get_or_create_session
    return self.__check_processors(
  File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 243, in __check_processors
    raise ValueError(f"processor in session meta is not valid: {session_meta}")
ValueError: processor in session meta is not valid: <ErSessionMeta(id=202304250903258499330_secure_add_example_0_0_host_10000, name=, status=NEW, tag=, processors=[***, len=2], options=[{'python.venv': '/data/projects/python/venv', 'eggroll.session.processors.per.node': '1', 'eggroll.session.deploy.mode': 'cluster', 'python.path': '/data/projects/fate/fate/python:$PYTHONPATH:/data/projects/fate/fate/python:/data/projects/fate/eggroll/python:/data/projects/fate/fateflow/python:/data/projects/fate/fate/python/fate_client', 'eggroll.rollpair.inmemory_output': 'True'}]) at 0x7f32fb65de50>

Correct rows should be:

image

For any one tring to upgrade from 1.8.0 to 1.9.2(1.9.0), follwing sql line should be excuted in eggroll_meta.server_node:

truncate table server_node;
INSERT INTO server_node (host, port, node_type, status) values ('clustermanager', '4670', 'CLUSTER_MANAGER', 'HEALTHY');
INSERT INTO server_node (host, port, node_type, status) values ('nodemanager-0.nodemanager', '4671', 'NODE_MANAGER', 'HEALTHY');
INSERT INTO server_node (host, port, node_type, status) values ('nodemanager-1.nodemanager', '4671', 'NODE_MANAGER', 'HEALTHY');

Idea from:

Maybe we should update above lines to fate released version upgrade sql file:

owlet42 commented 1 year ago

1.8.0->1.9.x The upgrade host field does not seem to be updated correctly. Thanks a lot for your suggestion, but this method is not easy to implement. It's hard to know how many nodemanagers there are during an upgrade. When there are multiple nodemanagers, the corresponding sql scripts also need to be changed. In this case, I recommend doing the sql manually. The upgrade documentation will be updated later.