Derecho-Project / derecho

The main code repository for the Derecho project.
BSD 3-Clause "New" or "Revised" License
182 stars 46 forks source link

Stuck in group construction. #231

Open Steamgjk opened 2 years ago

Steamgjk commented 2 years ago

I am trying to run simple_replicated_objects, so I create 3 VMs in Google Cloud. I am using derecho.cfg in the demos/json_cfgs path, Here is my modifications: (1) Change local ip to the corresponding VMs' ips, local_ids are 0, 1, 2 respectively (2) Change leader ip to my VM-0 ip (local_id=0 is the leader) (3) provider = sockets (4) domain = ens4 (This is the NIC name of all VMs)

图片

However, After I launch the 3 VMs, I am trapped in constructing the groups. Below is the leader VM's console log.
We can see the other 2 VMs have successfully connected the leader, so the IP-related staff should be correct. Then I am not sure what goes wrong with the config (I am suspicious it is because of the json_layout, but I am not sure). I attach the three cfg files for reference, and really appreciate if you staff can provide some help. Thanks!

图片

derecho-0(leader).cfg.txt derecho-1.cfg.txt derecho-2.cfg.txt

songweijia commented 2 years ago

Hi Steamgjk, thank you for trying out derecho. After I checked your configuration file, I found there are two issues with it. The major issue is related to the json_layout configuration. Your current configuration needs 6 nodes to start the service that's why the system keeps waiting after you started three. If you don't mind overlapping Bar subgroup and Foo subgroup, you can do the following:

json_layout = '
[
    {
        "type_alias":   "Foo",
        "layout":       [
                            {
                                "min_nodes_by_shard": ["3"],
                                "max_nodes_by_shard": ["3"],
                                "reserved_node_ids_by_shard":[["0","1","2"]],
                                "delivery_modes_by_shard": ["Ordered"],
                                "profiles_by_shard": ["VCS"]
                            }
                        ]
    },
    {
        "type_alias":   "Bar",
        "layout":       [
                            {
                                "min_nodes_by_shard": ["3"],
                                "max_nodes_by_shard": ["3"],
                                "reserved_node_ids_by_shard":[["0","1","2"]],
                                "delivery_modes_by_shard": ["Ordered"],
                                "profiles_by_shard": ["DEFAULT"]
                            }
                        ]
    }
]'

The above setting will enforce the overlapping of the Foo and Bar subgroups.

The minor issue is the provider setting. As libfabric is deprecating the socket provider, we suggests using tcp provider.

Steamgjk commented 2 years ago

Hi, @songweijia

Seems it still does not work in my 3-VM cluster. I have updated the 3 cfg files related to the 2 issues as the attached zip file, but it is still stuck there.

图片 图片 图片

cfg.zip

songweijia commented 2 years ago

I just realized that you were using simple_replicated_objects instead of simple_replicated_objects_json. The former one specified a layout using the programmable DefaultSubgroupAllocation API which needs 6 nodes. It will NOT use the json layout configuration in that case. To use the json layout, you have to use the latter one.

So, you can either try with 6 nodes, or you can use my suggestion to the json layout with simple_replicated_objects_json.

Steamgjk commented 2 years ago

simple_replicated_objects_json

@songweijia With simple_replicated_objects_json, the group construction does not block any more. But when I set up the node-0 (leader) and and node-1, after I set up node-2, then node-1 crashe.

I then check by commenting some codes step by step, then I notice the foo part is okay: After I comment the bar part https://github.com/Derecho-Project/derecho/blob/724a1db550d25052719eac55d8a685fc1c07d603/src/applications/demos/simple_replicated_objects_json.cpp#L95-L129

then the cluster can run and I can see the printed logs "Node says...".

Then, I comment the foo part https://github.com/Derecho-Project/derecho/blob/724a1db550d25052719eac55d8a685fc1c07d603/src/applications/demos/simple_replicated_objects_json.cpp#L59-L93

and only maintain the bar part.

This time, it goes to the problem again: After launch node-0 and node-1, then I launch node-2, then node-1 crashes.

Then, I continue to comment https://github.com/Derecho-Project/derecho/blob/724a1db550d25052719eac55d8a685fc1c07d603/src/applications/demos/simple_replicated_objects_json.cpp#L109-L128

so this time, only node-0 does void_future, node-1 and node-2 does not read, then the 3 nodes are fine.

But if I only comment https://github.com/Derecho-Project/derecho/blob/724a1db550d25052719eac55d8a685fc1c07d603/src/applications/demos/simple_replicated_objects_json.cpp#L119-L128

then node-0 does void_future and node-1 read, then node-1 still crashes after three nodes finish constructing the group.

node-1-log.txt