esl / MongooseIM

MongooseIM is Erlang Solutions' robust, scalable and efficient XMPP server, aimed at large installations. Specifically designed for enterprise purposes, it is fault-tolerant and can utilise the resources of multiple clustered machines.
Other
1.64k stars 422 forks source link

-proto_dist inet6_tcp support? #4132

Open andywhite37 opened 7 months ago

andywhite37 commented 7 months ago

MongooseIM version: 6.1.0 Installed from: Docker Erlang/OTP version: version packaged with MongooseIM 6.1.0

I posted a previous issue #4127 about ipv6 support with mongooseimctl, but I'm feeling like the problem runs deeper. I have the servers starting up and connecting to an RDBMS correctly, and I have been able to exchange messages with the server using an XMPP client (Adium). I've tried exercising the XMPP port (5222), WebSockets, GraphQL, and all of that seems to be working fine.

I have been struggling mightily to get MongooseIM clustering in an ipv6-based network in Kubernetes, both with mnesia and the new cets support. I'm unfortunately not an Erlang developer, so I've been doing a lot of reading. My research has led me to adding -proto_dist inet6_tcp in the vm.args/vm.dist.args, but I haven't had much luck with this.

This is what I currently have in vm.dist.args (I actually have these lines duplicated in vm.args too, just in case there are contexts that only use one or other of the files):

-proto_dist inet6_tcp
-kernel inet_dist_listen_min 9100
-kernel inet_dist_listen_max 9110

When I inspect the tcp listeners on the containers, I see epmd listening on port 4369 on both the ipv4 and ipv6 interfaces. However, when the listener is started on port 9100, it's only on the ipv4 interface, and not ipv6.

root@mongooseim-1:/# ss -tnlp | sort
LISTEN 0      1024                [::1]:5551         [::]:*    users:(("beam.smp",pid=29,fd=36))
LISTEN 0      1024                [::1]:8088         [::]:*    users:(("beam.smp",pid=29,fd=34))
LISTEN 0      1024                    *:5222            *:*    users:(("beam.smp",pid=29,fd=31))
LISTEN 0      1024                    *:5269            *:*    users:(("beam.smp",pid=29,fd=39))
LISTEN 0      1024                    *:5280            *:*    users:(("beam.smp",pid=29,fd=32))
LISTEN 0      1024                    *:5285            *:*    users:(("beam.smp",pid=29,fd=33))
LISTEN 0      1024                    *:5541            *:*    users:(("beam.smp",pid=29,fd=37))
LISTEN 0      1024                    *:5561            *:*    users:(("beam.smp",pid=29,fd=38))
LISTEN 0      1024                    *:8089            *:*    users:(("beam.smp",pid=29,fd=35))
LISTEN 0      1024   [::ffff:127.0.0.1]:8888            *:*    users:(("beam.smp",pid=29,fd=40))
LISTEN 0      128               0.0.0.0:9100      0.0.0.0:*    users:(("beam.smp",pid=29,fd=17))
LISTEN 0      4096              0.0.0.0:4369      0.0.0.0:*    users:(("epmd",pid=58,fd=3))     
LISTEN 0      4096                 [::]:4369         [::]:*    users:(("epmd",pid=58,fd=4))

When it's running this way, when I run mongooseimctl I get the nodedown error, I believe because my hostnames resolve to ipv6 addresses, so they want to connect to ports 9100-9110, but on the ipv6 address, rather than ipv4.

As an experiment, we tried running socat -dd TCP-LISTEN:9100,ipv6only,fork TCP4:127.0.0.1:9100 to set up an ipv6 listener to forward to the ipv4 address on the same port (we did this for all the ports 9100-9110), and that actually allows mongooseimctl to work and I can run commands, but it doesn't seem like this workaround works for mnesia and cets for clustering.

My suspicion is that -proto_dist inet6_tcp is not being respected somewhere (because whatever is starting the listener on 9100 is still just using ipv4), or some networking code is not using ipv6-compatible TCP options somewhere. I've looked through a lot of code in MongooseIM and cets for clues, but I don't have the background in erlang distribution/networking to know exactly where to look or what to look for.

root@mongooseim-1:/# hostname -f
mongooseim-1.mongooseim.qwick-chat.svc.cluster.local
root@mongooseim-1:/# mongooseim ping mongooseim@mongooseim-0.mongooseim.qwick-chat.svc.cluster.local
pong
root@mongooseim-1:/# mongooseim ping mongooseim@mongooseim-1.mongooseim.qwick-chat.svc.cluster.local
pong

mongooseimctl cets systemInfo output:

root@mongooseim-1:/# mongooseimctl cets systemInfo
{
  "data" : {
    "cets" : {
      "systemInfo" : {
        "unavailableNodes" : [
          "mongooseim@mongooseim-0.mongooseim.qwick-chat.svc.cluster.local"
        ],
        "remoteUnknownTables" : [

        ],
        "remoteNodesWithoutDisco" : [

        ],
        "remoteNodesWithUnknownTables" : [

        ],
        "remoteNodesWithMissingTables" : [

        ],
        "remoteMissingTables" : [

        ],
        "joinedNodes" : [
          "mongooseim@mongooseim-1.mongooseim.qwick-chat.svc.cluster.local"
        ],
        "discoveryWorks" : true,
        "discoveredNodes" : [
          "mongooseim@mongooseim-0.mongooseim.qwick-chat.svc.cluster.local",
          "mongooseim@mongooseim-1.mongooseim.qwick-chat.svc.cluster.local"
        ],
        "conflictTables" : [

        ],
        "conflictNodes" : [

        ],
        "availableNodes" : [
          "mongooseim@mongooseim-1.mongooseim.qwick-chat.svc.cluster.local"
        ]
      }
    }
  }
}

Questions

arcusfelis commented 7 months ago

Hi, node being unavailable (i.e. unavailableNodes) means it failed net_adm:ping.

i.e.

net_adm:ping('mongooseim@mongooseim-0.mongooseim.qwick-chat.svc.cluster.local').
pang

What should you check?

Oh, and there is resolver logic in erlang too:

inet:gethostbyname('google.com', inet6).
{ok,{hostent,"google.com",[],inet6,16,
             [{10752,5200,16411,2062,0,0,0,8206}]}}

To debug deeper we would need to figure out how to configure docker desktop for k8s with ipv6 only. Or the same but on Circle CI ;)

chrzaszcz commented 7 months ago

Hi @andywhite37. I can confirm that the inet6_tcp option is supported. You can check it with the following:

The difference to your setup seems to lie in the DNS resolution, as @arcusfelis suggested.


I think I'd ask you to do some debugging on your side. Run mongooseimctl debug on one of your nodes. Then, in the Erlang shell, try to do the following:

inet:gethostname().
net_adm:names().

Please provide the results. Could you also tell me what hostname returns (without -f) and what it resolves to? My first guess would be that it's not possible to reach epmd.