PlatformLab / grpc_homa

Allows Homa to be used as a transport with gRPC.
25 stars 5 forks source link

couldn't connect remote grpc_homa server #13

Closed mikehb closed 10 months ago

mikehb commented 10 months ago

homa.tar.gz This client could connect local homa server, but couldn't connect remote homa server. Meanwhile, it could connect remote tcp server

Ubuntu 22.04.2 LTS with linux kernel 6.1.40-060140

run grpc_homa server on 10.212.155.216

root@hb:~/homa# ./bazel-bin/server -homa 10.212.155.216:4000
0
1
2
3
4
5
6

run grpc_homa client on 10.212.155.195

root@hb1:~/homa# ./client  -homa 10.212.155.216:4000
D1016 11:29:28.354833606    6747 config.cc:161]                        gRPC EXPERIMENT tcp_frame_size_tuning               OFF (default:OFF)
D1016 11:29:28.354932976    6747 config.cc:161]                        gRPC EXPERIMENT tcp_rcv_lowat                       OFF (default:OFF)
D1016 11:29:28.354951680    6747 config.cc:161]                        gRPC EXPERIMENT peer_state_based_framing            OFF (default:OFF)
D1016 11:29:28.354958266    6747 config.cc:161]                        gRPC EXPERIMENT memory_pressure_controller          OFF (default:OFF)
D1016 11:29:28.354964563    6747 config.cc:161]                        gRPC EXPERIMENT unconstrained_max_quota_buffer_size OFF (default:OFF)
D1016 11:29:28.354975116    6747 config.cc:161]                        gRPC EXPERIMENT event_engine_client                 OFF (default:OFF)
D1016 11:29:28.354981616    6747 config.cc:161]                        gRPC EXPERIMENT monitoring_experiment               ON  (default:ON)
D1016 11:29:28.354988091    6747 config.cc:161]                        gRPC EXPERIMENT promise_based_client_call           OFF (default:OFF)
D1016 11:29:28.354994569    6747 config.cc:161]                        gRPC EXPERIMENT free_large_allocator                OFF (default:OFF)
D1016 11:29:28.355004387    6747 config.cc:161]                        gRPC EXPERIMENT promise_based_server_call           OFF (default:OFF)
D1016 11:29:28.355014333    6747 config.cc:161]                        gRPC EXPERIMENT transport_supplies_client_latency   OFF (default:OFF)
D1016 11:29:28.355024255    6747 config.cc:161]                        gRPC EXPERIMENT event_engine_listener               OFF (default:OFF)
D1016 11:29:28.355030982    6747 config.cc:161]                        gRPC EXPERIMENT schedule_cancellation_over_write    OFF (default:OFF)
D1016 11:29:28.355040797    6747 config.cc:161]                        gRPC EXPERIMENT trace_record_callops                OFF (default:OFF)
D1016 11:29:28.355050833    6747 config.cc:161]                        gRPC EXPERIMENT event_engine_dns                    OFF (default:OFF)
D1016 11:29:28.355057329    6747 config.cc:161]                        gRPC EXPERIMENT work_stealing                       OFF (default:OFF)
D1016 11:29:28.355068054    6747 config.cc:161]                        gRPC EXPERIMENT client_privacy                      OFF (default:OFF)
D1016 11:29:28.355077794    6747 config.cc:161]                        gRPC EXPERIMENT canary_client_privacy               OFF (default:OFF)
D1016 11:29:28.355087643    6747 config.cc:161]                        gRPC EXPERIMENT server_privacy                      OFF (default:OFF)
D1016 11:29:28.355093809    6747 config.cc:161]                        gRPC EXPERIMENT unique_metadata_strings             OFF (default:OFF)
D1016 11:29:28.355103835    6747 config.cc:161]                        gRPC EXPERIMENT keepalive_fix                       OFF (default:OFF)
I1016 11:29:28.355461345    6747 ev_epoll1_linux.cc:123]               grpc epoll fd: 3
D1016 11:29:28.355489210    6747 ev_posix.cc:113]                      Using polling engine: epoll1
D1016 11:29:28.356058154    6747 lb_policy_registry.cc:47]             registering LB policy factory for "priority_experimental"
D1016 11:29:28.356082309    6747 lb_policy_registry.cc:47]             registering LB policy factory for "outlier_detection_experimental"
D1016 11:29:28.356092726    6747 lb_policy_registry.cc:47]             registering LB policy factory for "weighted_target_experimental"
D1016 11:29:28.356107867    6747 lb_policy_registry.cc:47]             registering LB policy factory for "pick_first"
D1016 11:29:28.356117214    6747 lb_policy_registry.cc:47]             registering LB policy factory for "round_robin"
D1016 11:29:28.356126610    6747 lb_policy_registry.cc:47]             registering LB policy factory for "weighted_round_robin"
D1016 11:29:28.356156192    6747 lb_policy_registry.cc:47]             registering LB policy factory for "grpclb"
D1016 11:29:28.356188515    6747 dns_resolver_plugin.cc:44]            Using ares dns resolver
D1016 11:29:28.356220387    6747 lb_policy_registry.cc:47]             registering LB policy factory for "rls_experimental"
D1016 11:29:28.356267339    6747 lb_policy_registry.cc:47]             registering LB policy factory for "xds_cluster_manager_experimental"
D1016 11:29:28.356285507    6747 lb_policy_registry.cc:47]             registering LB policy factory for "xds_cluster_impl_experimental"
D1016 11:29:28.356294778    6747 lb_policy_registry.cc:47]             registering LB policy factory for "cds_experimental"
D1016 11:29:28.356307453    6747 lb_policy_registry.cc:47]             registering LB policy factory for "xds_cluster_resolver_experimental"
D1016 11:29:28.356316783    6747 lb_policy_registry.cc:47]             registering LB policy factory for "xds_override_host_experimental"
D1016 11:29:28.356329659    6747 lb_policy_registry.cc:47]             registering LB policy factory for "xds_wrr_locality_experimental"
D1016 11:29:28.356349026    6747 lb_policy_registry.cc:47]             registering LB policy factory for "ring_hash_experimental"
D1016 11:29:28.356361001    6747 certificate_provider_registry.cc:33]  registering certificate provider factory for "file_watcher"
I1016 11:29:28.357525374    6747 ev_epoll1_linux.cc:360]               grpc epoll fd: 5
I1016 11:29:28.359125165    6747 homa_client.cc:299]                   HomaClient::perform_op invoked with start_connectivity_watch
I1016 11:29:28.359393580    6752 homa_stream.cc:304]                   Outgoing metadata: key :path, value /rtk.Rtk/WatchTextMessage
I1016 11:29:28.359488551    6752 homa_stream.cc:304]                   Outgoing metadata: key :authority, value 10.212.155.216:4000
I1016 11:29:28.359706082    6752 homa_stream.cc:177]                   Sent Homa request to 10.212.155.216:4000, stream id 1, sequence 1 with homaId 306, 75 initial metadata bytes, 0 payload bytes, 0 trailing metadata bytes
E1016 11:29:28.418684716    6756 homa_incoming.cc:185]                 Error in recvmsg (homaId 306): Connection timed out
I1016 11:29:28.419125402    6756 homa_stream.cc:668]                   Recording error for stream id 1: UNKNOWN: Connection timed out [type.googleapis.com/grpc.status.str.file='external/grpc_homa/homa_incoming.cc'] [type.googleapis.com/grpc.status.int.file_line='187'] [type.googleapis.com/grpc.status.time.created_time='2023-10-16T03:29:28.418917266+00:00'] [type.googleapis.com/grpc.status.int.errno='110'] [type.googleapis.com/grpc.status.str.os_error='Connection timed out'] [type.googleapis.com/grpc.status.str.syscall='recvmsg']
I1016 11:29:28.419500842    6756 homa_stream.cc:696]                   Sending peer cancellation for RPC id 1
I1016 11:29:28.419630327    6756 homa_stream.cc:177]                   Sent Homa request to 10.212.155.216:4000, stream id 1, sequence 2 with homaId 308, 0 initial metadata bytes, 0 payload bytes, 0 trailing metadata bytes
I1016 11:29:28.419758252    6756 homa_stream.cc:668]                   Recording error for stream id 1: UNKNOWN: Connection timed out [type.googleapis.com/grpc.status.str.file='external/grpc_homa/homa_incoming.cc'] [type.googleapis.com/grpc.status.int.file_line='187'] [type.googleapis.com/grpc.status.time.created_time='2023-10-16T03:29:28.418917266+00:00'] [type.googleapis.com/grpc.status.int.errno='110'] [type.googleapis.com/grpc.status.str.os_error='Connection timed out'] [type.googleapis.com/grpc.status.str.syscall='recvmsg']
I1016 11:29:28.419879154    6756 homa_stream.cc:668]                   Recording error for stream id 1: UNKNOWN: Connection timed out [type.googleapis.com/grpc.status.str.file='external/grpc_homa/homa_incoming.cc'] [type.googleapis.com/grpc.status.int.file_line='187'] [type.googleapis.com/grpc.status.time.created_time='2023-10-16T03:29:28.418917266+00:00'] [type.googleapis.com/grpc.status.int.errno='110'] [type.googleapis.com/grpc.status.str.os_error='Connection timed out'] [type.googleapis.com/grpc.status.str.syscall='recvmsg']
I1016 11:29:28.420490416    6756 homa_stream.cc:668]                   Recording error for stream id 1: UNKNOWN: Connection timed out [type.googleapis.com/grpc.status.str.file='external/grpc_homa/homa_incoming.cc'] [type.googleapis.com/grpc.status.int.file_line='187'] [type.googleapis.com/grpc.status.time.created_time='2023-10-16T03:29:28.418917266+00:00'] [type.googleapis.com/grpc.status.int.errno='110'] [type.googleapis.com/grpc.status.str.os_error='Connection timed out'] [type.googleapis.com/grpc.status.str.syscall='recvmsg']
E1016 11:29:28.479299971    6756 homa_incoming.cc:185]                 Error in recvmsg (homaId 308): Connection timed out
mikehb commented 10 months ago

HomaModule's homa_test couldn't connect the remote server too

root@hb1:~/HomaModule/util# ./homa_test 10.212.155.216:4000 stream
Count too large; reducing from 1000 to 100
Error in recvmsg: Connection timed out
root@hb1:~/HomaModule/util#

root@hb:~/HomaModule/util# ./server

root@hb:~/HomaModule/util# netstat -nap | grep server
Active Internet connections (servers and established)
tcp        0      0 0.0.0.0:4000            0.0.0.0:*               LISTEN      110254/./server
Active UNIX domain sockets (servers and established)
root@hb:~/HomaModule/util#
johnousterhout commented 10 months ago

Strange: all of these tests work fine for me, both locally and remotely. I wonder if there might be a problem in delivering Homa packets over your network (e.g., perhaps they are getting dropped by the switch, or perhaps Homa is trying to use TSO but your NICs refuse to do TSO for Homa packets and just drop them)? How about trying a simple Homa test:

-John-

On Mon, Oct 16, 2023 at 4:02 AM mikehb @.***> wrote:

HomaModule's homa_test couldn't connect the remote server too

@.:~/HomaModule/util# ./homa_test 10.212.155.216:4000 stream Count too large; reducing from 1000 to 100 Error in recvmsg: Connection timed out @.:~/HomaModule/util#

@.***:~/HomaModule/util# ./server

@.:~/HomaModule/util# netstat -nap | grep server Active Internet connections (servers and established) tcp 0 0 0.0.0.0:4000 0.0.0.0: LISTEN 110254/./server Active UNIX domain sockets (servers and established) **@.***:~/HomaModule/util#

— Reply to this email directly, view it on GitHub https://github.com/PlatformLab/grpc_homa/issues/13#issuecomment-1764228717, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOOUCUD7VSLQL2LIGB4OTLX7UH25AVCNFSM6AAAAAA6BQ5H2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRUGIZDQNZRG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mikehb commented 10 months ago

I execute cp_node server/client on the same machine, but it prints following error. Is the cp_node using fixed node name? Could cp_node add an option to take a server ip address?

root@hb:~/HomaModule/util# hostname
hb
root@hb:~/HomaModule/util# ./cp_node client
1697682117.885167844 FATAL: couldn't look up address for node1: Temporary failure in name resolution
root@hb:~/HomaModule/util#
mikehb commented 10 months ago

I modified the code to use server's IP, and cp_node still print "Connection timed out"

/**
 * init_server_addrs() - Set up the server_addrs table (addresses of the
 * server/port combinations that clients will communicate with), based on
 * current configuration parameters. Any previous contents of the table
 * are discarded. This also initializes related arrays @server_ids and
 * @freeze.
 */
void init_server_addrs(void)
{
        server_addrs.clear();
        server_conns.clear();
        freeze.clear();
        first_id.clear();
        for (int node: server_ids) {
                char host[100];
                struct addrinfo hints;
                struct addrinfo *matching_addresses;
                sockaddr_in_union *dest;

                if (node == id)
                        continue;
                //snprintf(host, sizeof(host), "node%d", node);
                snprintf(host, sizeof(host), "%s", "10.212.155.216");
                memset(&hints, 0, sizeof(struct addrinfo));
                hints.ai_family = inet_family;
                hints.ai_socktype = SOCK_DGRAM;
                int status = getaddrinfo(host, NULL, &hints,
                                &matching_addresses);
                if (status != 0) {
                        log(NORMAL, "FATAL: couldn't look up address "
                                        "for %s: %s\n",
                                        host, gai_strerror(status));
                        exit(1);
                }
                dest = reinterpret_cast<sockaddr_in_union *>
                                (matching_addresses->ai_addr);
                while (((int) first_id.size()) < node)
                        first_id.push_back(-1);
                first_id.push_back((int) server_addrs.size());
                for (int thread = 0; thread < server_ports; thread++) {
                        dest->in4.sin_port = htons(first_port + thread);
                        server_addrs.push_back(*dest);
                        server_conns.emplace_back(node, thread, id, 0);
                }
                while (((int) freeze.size()) <= node)
                        freeze.push_back(0);
                freeaddrinfo(matching_addresses);
        }
}
root@hb1:~/HomaModule/util# ./cp_node client
1697683451.762402699 Average message length 0.1 KB, rate 0.00 K/sec, expected BW 0.0 Gbps
1697683451.825312410 FATAL: error in recvmsg: Connection timed out (id 0, server Unknown family 0)
root@hb1:~/HomaModule/util#
mikehb commented 10 months ago

"ideally while the test is still running and failing", the cp_node client exist when connection failed node_client.txt node_server.txt

johnousterhout commented 10 months ago

I took a look at the traces but unfortunately the client trace is completely empty. That doesn't make sense to me: at the least there should be invocations of Homa to send a message, timer firings, and so on. Can you try to collect it again? The server trace is nearly empty, but plausible: the application invokes recvmsg but no packets seem to arrive.

-John-

On Wed, Oct 18, 2023 at 7:58 PM mikehb @.***> wrote:

"ideally while the test is still running and failing", the cp_node client exist when connection failed node_client.txt https://github.com/PlatformLab/grpc_homa/files/13037142/node_client.txt node_server.txt https://github.com/PlatformLab/grpc_homa/files/13037144/node_server.txt

— Reply to this email directly, view it on GitHub https://github.com/PlatformLab/grpc_homa/issues/13#issuecomment-1769813411, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOOUCX6QZUPRO4OA4KSAITYACJL5AVCNFSM6AAAAAA6BQ5H2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRZHAYTGNBRGE . You are receiving this because you commented.Message ID: @.***>

mikehb commented 10 months ago
root@hb1:~/HomaModule# lsmod | grep homa
homa                  294912  0
root@hb1:~/HomaModule# cd util/
root@hb1:~/HomaModule/util# ./cp_node client
1697697036.039309767 Average message length 0.1 KB, rate 0.00 K/sec, expected BW 0.0 Gbps
1697697036.105974125 FATAL: error in recvmsg: Connection timed out (id 0, server Unknown family 0)
root@hb1:~/HomaModule/util# ll /proc/time
timer_list  timetrace
root@hb1:~/HomaModule/util# cat /proc/timetrace
cat: /proc/timetrace: Bad address
root@hb1:~/HomaModule/util# ./ttprint.py
root@hb1:~/HomaModule/util#
mikehb commented 10 months ago

node_client.txt node_server.txt

execute following script in on terminal for (( ;; )) do ./cp_node client; sleep 1; done and catch the trace on another terminal ./ttprint.py > node_client.txt

./ttprint.py is empty since there is not client connection. I got the above node_server.txt by executing './cp_node client' in the same host

mikehb commented 10 months ago
root@hb:~# ethtool -k ens3 | grep -i tcp-segmentation-offload
tcp-segmentation-offload: on
root@hb:~#
johnousterhout commented 10 months ago

I'm not sure I understand what's happening with your attempts to get client-side timetraces.

First, reading the timetrace will clear the buffer, so if you invoke "cat /proc/timetrace" and then invoke ttprint.py, ttprint.py will return nothing unless there has been additional Homa activity since the "cat /proc/timetrace".

I don't understand your comment "./ttprint.py is empty since there is not client connection": if you ran "cp_node client" a few times, which appears to be the case, there should be buffered timetrace records (the buffer is only cleared when the timetrace is read).

From the latest node_client.txt file and the node_server.txt file you sent yesterday, it seems that Homa packets are being transmitted successfully but they are not being received. This suggests that they are getting dropped somewhere in the network. What is the environment in which you are running your experiments? Do you control the switch? I haven't tried running Homa in cloud providers like AWS; if you are running there, perhaps the cloud provider refuses to transmit packets that use unknown protocols?

Since this test only sends small packets, there shouldn't be any issues with TCP segmentation offload.

-John-

On Thu, Oct 19, 2023 at 3:38 AM mikehb @.***> wrote:

@.:~# ethtool -k ens3 | grep -i tcp-segmentation-offload tcp-segmentation-offload: on @.:~#

— Reply to this email directly, view it on GitHub https://github.com/PlatformLab/grpc_homa/issues/13#issuecomment-1770541373, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOOUCRTFVDREXB2YG36JN3YAD7LFAVCNFSM6AAAAAA6BQ5H2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZQGU2DCMZXGM . You are receiving this because you commented.Message ID: @.***>

mikehb commented 10 months ago

I am using Ubuntu running on openstack

johnousterhout commented 10 months ago

I use Ubuntu, so I'm pretty sure that's not the problem. I've never used openstack; is there some way you can find out how it deals with exotic network protocols? (Homa uses IP protocol 0xFD, which is defined as "Use for experimentation and testing"). You could also try changing the definition of IPPROTO_HOMA in homa.h to see if different values work (I don't have any suggestions for other values to try, though).

-John-

On Thu, Oct 19, 2023 at 7:48 PM mikehb @.***> wrote:

I am using Ubuntu running on openstack

— Reply to this email directly, view it on GitHub https://github.com/PlatformLab/grpc_homa/issues/13#issuecomment-1771992938, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOOUCXNUSMA2OEPZIPBIFLYAHQ7LAVCNFSM6AAAAAA6BQ5H2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRHE4TEOJTHA . You are receiving this because you commented.Message ID: @.***>

mikehb commented 10 months ago

I switched to VMware Workstation 17, it works for me, cp_node client could connect remote cp_node server. Maybe something wrong with Openstack, let me take a further investigation.

mikehb commented 10 months ago

Network administrator disabled the network filter, and I could connect remote grpc_homa server now

johnousterhout commented 10 months ago

Excellent; glad to hear it!