Ylianst / MeshCentral

A complete web-based remote monitoring and management web site. Once setup you can install agents and perform remote desktop session to devices on the local network or over the Internet.
https://meshcentral.com
Apache License 2.0
4.22k stars 567 forks source link

Server peering failure #427

Closed mfw78 closed 4 years ago

mfw78 commented 5 years ago

Attempting to configure server peering fails with the following configuration:

  "peers": {
    "serverId": "asrss01",
    "servers": {
      "asrss01": { "url": "ws://1.2.3.4:4430/" },
      "asrss02": { "url": "ws://5.6.7.8:4430/" },
      "ocrss01": { "url": "ws://9.10.11.12:4430/" }
    }
  },

MeshCentral logs an error as such:

node[2612]: Error: Unable to peer with other servers, "http-1" not present in peer servers list.

http-1 is actually the hostname of the server in question. I recalled in the documentation that MeshCentral will automatically failover to the hostname should the serverId option not be configured. I went digging through multiserver.js.

When debugging, it appears that there is an error whereby the case sensitivity over serverId is ignored when the config.json is parsed, and therefore the serverId check in multiserver.js seems to be failing (please excuse my poor JavaScript skills, I'm unsure if JSON keys are case sensitive, but a quick Google seemed to indicate that they are).

I tweaked the logic in multiserver.js for a lowercase serverid as follows:

    // If we have no peering configuration, don't setup this object
    if (obj.peerConfig == null) { return null; }
    obj.serverid = obj.parent.config.peers.serverid;

But this didn't help much as I was greated with a whole array of failures subsequently:

Aug 12 02:55:57 http-1 node[3073]: ERROR: MeshCentral failed with critical error, check MeshErrors.txt. Restarting in 5 seconds...
Aug 12 02:55:57 http-1 node[3073]:    '/usr/bin/node /home/app-meshcentral/node_modules/meshcentral --launch 3073' }
Aug 12 02:55:57 http-1 node[3073]:   cmd:
Aug 12 02:55:57 http-1 node[3073]:   signal: null,
Aug 12 02:55:57 http-1 node[3073]:   code: 1,
Aug 12 02:55:57 http-1 node[3073]:   killed: false,
Aug 12 02:55:57 http-1 node[3073]:     at Process.ChildProcess._handle.onexit (internal/child_process.js:266:5)
Aug 12 02:55:57 http-1 node[3073]:     at maybeClose (internal/child_process.js:999:16)
Aug 12 02:55:57 http-1 node[3073]:     at ChildProcess.emit (events.js:198:15)
Aug 12 02:55:57 http-1 node[3073]:     at ChildProcess.exithandler (child_process.js:299:12)
Aug 12 02:55:57 http-1 node[3073]:     at Socket.Readable.push (_stream_readable.js:231:10)
Aug 12 02:55:57 http-1 node[3073]:     at readableAddChunk (_stream_readable.js:276:11)
Aug 12 02:55:57 http-1 node[3073]:     at addChunk (_stream_readable.js:295:12)
Aug 12 02:55:57 http-1 node[3073]:     at Socket.emit (events.js:193:13)
Aug 12 02:55:57 http-1 node[3073]:     at Socket.socketOnData (_http_client.js:480:11)
Aug 12 02:55:57 http-1 node[3073]:     at ClientRequest.emit (events.js:193:13)
Aug 12 02:55:57 http-1 node[3073]:     at ClientRequest.req.on (/home/app-meshcentral/node_modules/ws/lib/websocket.js:665:15)
Aug 12 02:55:57 http-1 node[3073]:     at WebSocket.setSocket (/home/app-meshcentral/node_modules/ws/lib/websocket.js:170:10)
Aug 12 02:55:57 http-1 node[3073]:     at WebSocket.emit (events.js:193:13)
Aug 12 02:55:57 http-1 node[3073]:     at WebSocket.<anonymous> (/home/app-meshcentral/node_modules/meshcentral/multiserver.js:77:10>
Aug 12 02:55:57 http-1 node[3073]: TypeError: obj.ws._socket.getPeerCertificate is not a function
Aug 12 02:55:57 http-1 node[3073]:                                                                                                  >
Aug 12 02:55:57 http-1 node[3073]:                 var serverCert = obj.forge.pki.certificateFromAsn1(obj.forge.asn1.fromDer(obj.ws.>
Aug 12 02:55:57 http-1 node[3073]: /home/app-meshcentral/node_modules/meshcentral/multiserver.js:77
Aug 12 02:55:57 http-1 node[3073]: the options [useNewUrlParser] is not supported

Note: The raw log file above is organised from most recent log line to least recent log line. Sorry for the confusion there.

Ylianst commented 5 years ago

Oh my. I did early test of server peering a long time ago to make sure it could be supported at some point, but I have not tested it recently and I know there are going to be a bunch of things I need to fix for peering to work correctly. For one, I know not all server operations have the peering code written. So, it's like a not currently supported feature, however, it's certainly something I want to add. I will do some work to fix the big issues like this for sure.

mfw78 commented 5 years ago

Sure, no worries. I'm available for assistance with debugging / running tests.

To give you an overview of what I'm wanting to achieve, I'm setting up a multi-datacenter MeshCentral cluster (most probably very much an overkill for my requirements, but I'd rather the system be able to handle many multiple failures).

As such the topology is as follows:

  1. mesh.example.com DNS referring to anycast IP address 1.2.3.4
  2. Front facing instances (3 in total) hosted on 1.2.3.4 are NGINX with TLS offload for WWW and MPS. Each NGINX host has it's own Meshcentral instance.
  3. Meshcentral configured via MongoDb replicaSet for high availability.
  4. Server peering, with configuration referencing the relative unicast IP addresses of the cluster's members.

Expectations:

  1. All agents/Intel AMT devices connect to mesh.example.com. Given anycast IP address, should connect to it's own datacenter's Meshcentral instance.
  2. Able to logon to mesh.example.com from any data center and access all Intel AMT devices / agents across all datacenters.
  3. Able to failover n - 1 meshcentral instances and therefore still have Meshcentral operative as anycast IP address route advertisement will be removed as instances failover.

I've got all the anycast routing etc happening. Just the peering to go!

NB: Anycast IP address 1.2.3.4 referenced here are just for the purposes of demonstrating an IP address. It's not related at all to the aforementioned server peer URLs in MeshCentral's config.json.

Ylianst commented 5 years ago

You got the perfect setup for server peering. By the way, peering will require the MongoDB ChangeStream which requires a replicaset. In upcoming versions I will require that the MongoDB ChangeStream option be true when doing peering because this is how all the servers will be synchronized to the database.

With a little bit of work and testing, I should be able to get basic peering working. Would be great if you did testing on it and filed issues. Would certainly motivate me to move that feature along.

Ylianst commented 5 years ago

Published MeshCentral v0.3.9-r with server peering working, at least on my dev machine. Update all your servers and try again.

If you don't specify the "serverid", the lowercase of the hostname is used instead. That part should be fixed. You will also need this in the "settings" section of config.json:

"MongoDbChangeStream": true

This will make MeshCentral use MongoDB changes stream which is required to make peering work. Let me know how it goes. Certainly there are going to be some more things to fix, but the basics should work.

mfw78 commented 5 years ago

Updated to MeshCentral v0.3.9-r, and added in the changes for the settings. I've still left a "serverid" specified - the use case for reverting to hostname doesn't work well for my case, given that all hostnames of all the servers are the same, DNS suffixes are different, depending upon datacenter.

The following stack trace occurs repetitively on startup:

ERROR: MeshCentral failed with critical error, check MeshErrors.txt. Restarting in 5 seconds...
{ Error: Command failed: /usr/bin/node /home/app-meshcentral/node_modules/meshcentral --launch 25473
the options [useNewUrlParser] is not supported
/home/app-meshcentral/node_modules/mongodb/lib/mongo_client.js:433
          throw err
          ^

TypeError: obj.file.watch is not a function
    at /home/app-meshcentral/node_modules/meshcentral/db.js:237:49
    at connectCallback (/home/app-meshcentral/node_modules/mongodb/lib/mongo_client.js:527:5)
    at /home/app-meshcentral/node_modules/mongodb/lib/mongo_client.js:430:11
    at processTicksAndRejections (internal/process/task_queues.js:79:9)

    at ChildProcess.exithandler (child_process.js:299:12)
    at ChildProcess.emit (events.js:198:15)
    at maybeClose (internal/child_process.js:999:16)
    at Process.ChildProcess._handle.onexit (internal/child_process.js:266:5)
  killed: false,
  code: 1,
  signal: null,
  cmd:
   '/usr/bin/node /home/app-meshcentral/node_modules/meshcentral --launch 25473' }
Ylianst commented 5 years ago

Thanks for the quick report. I will take a look at it tomorrow.

MailYouLater commented 5 years ago

Just as an FYI, I had an error similar to the above without any server peering. What happened is that I updated to 0.3.9-r (at this point I wasn't having any issues yet), then I installed the agent on another computer. Any time the agent was running on that computer, the meshcentral server would crash, but if I stopped the agent on that computer it would work fine for all the other computers. I've since updated to 0.3.9-s, and now it all seems to be working fine again.

Ylianst commented 5 years ago

Thanks for the details. I am finishing up some other thing and I am going to dive into server peering.

As a side note: MeshCentral 0.3.9-r had a really bad problem with agent connections. Bryan spotted it a few hours ago and I rushed a fix.

(fixed MeshCentral 0.3.9-s to MeshCentral 0.3.9-r above)

Ylianst commented 5 years ago

Yes, You are correct.

Ylianst commented 5 years ago

Ok, I will need more information about your server. I really need your node version and mongodb package version. That would be these two commands made just above the "node_modules" folder:

node -v
npm view mongodb version

Thanks, Ylian

mfw78 commented 5 years ago

Sorry for such an obvious omission. I've also included OS version / kernel version as well.

Details as follows:

$ node -v
v11.15.0
$ npm view mongodb version
3.3.0
$ uname -a
Linux http-1 5.2.5-arch1-1-ARCH #1 SMP PREEMPT Wed Jul 31 08:30:34 UTC 2019 x86_64 GNU/Linux
Ylianst commented 5 years ago

I saw your reported issue and published MeshCentral v0.3.9-y with server peering fixes, should work now.

Before you update or run "npm install meshcentral" again, you should go in "node_modules" and remove "mongodb", "mongo-core" and "mongojs" folders to force these 3 to update or better yet, just rename or delete the entire "node_modules" folder and install MeshCentral v0.3.9-y or better.

It should work now. Let me know what you see.

Details: The "obj.file.watch is not a function" problem is because the "mongodb" module in the "node_modules" is a really old version that is pulled with "mongojs" that I don't use anymore. In v0.3.9-y I removed the dependency on "mongojs" and so, the old "mongodb" should no longer be installed. I also made a bunch more fixes as a result of setting up two different computers on the MongoDB and testing this setup again.

mfw78 commented 5 years ago

Per your advice, I removed node_modules completely and reinstalled MeshCentral in order to update the dependencies. Initially, the following occurred when I tried to start MeshCentral:

Aug 15 05:07:52 http-1 node[14859]: (node:15180) DeprecationWarning: current Server Discovery and Monitoring engine is deprecated, and will be removed in a future version. To use the new Server Discover and Monitoring engine, pass option>
Aug 15 05:07:52 http-1 node[14859]: /home/app-meshcentral/node_modules/meshcentral/multiserver.js:78
Aug 15 05:07:52 http-1 node[14859]:                 var serverCert = obj.forge.pki.certificateFromAsn1(obj.forge.asn1.fromDer(obj.ws._socket.getPeerCertificate().raw.toString('binary')));
Aug 15 05:07:52 http-1 node[14859]:                                                                                                          ^
Aug 15 05:07:52 http-1 node[14859]: TypeError: obj.ws._socket.getPeerCertificate is not a function
Aug 15 05:07:52 http-1 node[14859]:     at WebSocket.<anonymous> (/home/app-meshcentral/node_modules/meshcentral/multiserver.js:78:106)
Aug 15 05:07:52 http-1 node[14859]:     at WebSocket.emit (events.js:193:13)
Aug 15 05:07:52 http-1 node[14859]:     at WebSocket.setSocket (/home/app-meshcentral/node_modules/ws/lib/websocket.js:170:10)
Aug 15 05:07:52 http-1 node[14859]:     at ClientRequest.req.on (/home/app-meshcentral/node_modules/ws/lib/websocket.js:665:15)
Aug 15 05:07:52 http-1 node[14859]:     at ClientRequest.emit (events.js:193:13)
Aug 15 05:07:52 http-1 node[14859]:     at Socket.socketOnData (_http_client.js:480:11)
Aug 15 05:07:52 http-1 node[14859]:     at Socket.emit (events.js:193:13)
Aug 15 05:07:52 http-1 node[14859]:     at addChunk (_stream_readable.js:295:12)
Aug 15 05:07:52 http-1 node[14859]:     at readableAddChunk (_stream_readable.js:276:11)
Aug 15 05:07:52 http-1 node[14859]:     at Socket.Readable.push (_stream_readable.js:231:10)
Aug 15 05:07:52 http-1 node[14859]:     at ChildProcess.exithandler (child_process.js:299:12)
Aug 15 05:07:52 http-1 node[14859]:     at ChildProcess.emit (events.js:198:15)
Aug 15 05:07:52 http-1 node[14859]:     at maybeClose (internal/child_process.js:999:16)
Aug 15 05:07:52 http-1 node[14859]:     at Process.ChildProcess._handle.onexit (internal/child_process.js:266:5)
Aug 15 05:07:52 http-1 node[14859]:   killed: false,
Aug 15 05:07:52 http-1 node[14859]:   code: 1,
Aug 15 05:07:52 http-1 node[14859]:   signal: null,
Aug 15 05:07:52 http-1 node[14859]:   cmd:
Aug 15 05:07:52 http-1 node[14859]:    '/usr/bin/node /home/app-meshcentral/node_modules/meshcentral --launch 14859' }
Aug 15 05:07:52 http-1 node[14859]: ERROR: MeshCentral failed with critical error, check MeshErrors.txt. Restarting in 5 seconds...

Upon further investigation, I took a punt and bet on there being some problem with connecting via TLS to the peered servers. My configuration server peering has not changed, and still stands at:

  "peers": {
    "serverId": "asrss01",
    "servers": {
      "asrss01": { "url": "ws://1.2.3.4:4430/" },
      "asrss02": { "url": "ws://5.6.7.8:4430/" },
      "ocrss01": { "url": "ws://9.10.11.12:4430/" }
    }
  },

Given that the websocket URI is configured for non-TLS, I tweaked the URI to wss:// and MeshCentral started, no problems, with server peering, so all seems to work well with the wss:// URI.

As a side note, the entire cluster that I'm running, all servers are managed via Intel AMT, therefore I do have issues whereby some Intel AMT agents don't connect to Meshcentral, given that they are essentially trying to connect to themselves. Therefore, I was thinking of another method to solve this...

Would it be possible to have a setup whereby MeshCentral ran from one instance, but was accessible via two different domain names (eg. mesh1.example.com, and mesh2.example.com) with the same user id / password authentication that is currently used to provision AMT hosts. My approach would then be to:

  1. Change the script used to provision Intel AMT hosts to connect to both mesh1.example.com and mesh2.example.com, using the same credentials.
  2. mesh1.example.com DNS A record for anycast 1.2.3.4, mesh2.example.com DNS A record for anycast 5.6.7.8.
  3. Have NGINX bind to both 1.2.3.4 and 5.6.7.8. NB: there is one NGINX proxy in each data center. Given that the routing tables always favour the local NGINX proxy:
  4. Tweak the routing tables to increase the metric for 5.6.7.8 in the local data center to abnormally higher than 5.6.7.8 anycast'ed from other data centers.
  5. Therefore, the server running MeshCentral in the data center's Intel AMT agent would try to connect to mesh1.example.com/1.2.3.4, but given that's the same device, this would fail. Intel AMT would then try to connect to mesh2.example.com/5.6.7.8. Given that 5.6.7.8 is both on the same machine, but has an abnormally high metric, the routing tables would bias this traffic to be sent to another MeshCentral instance in another data center that's on 5.6.7.8 (anycasted). Thus, the local Intel AMT agent connects to Meshcentral in another data center, and by act of server peering, would therefore be visible to any users connected to any Meshcentral instance in any data center.

Sorry for the round about way of describing what I'm wanting to achieve. I hope it makes sense!

mfw78 commented 5 years ago

Additional testing of server peering has yielded an odd problem (though no error log from what I can see)...

When using server peering as described above with wss:// URIs, I notice that only non-Intel AMT agents are showing as connected to MeshCentral.

Current settings in config.json:

    "TlsOffload": true,
    "MpsPort": 44330,
    "MpsAliasPort": 4433,
    "MpsTlsOffload": true,

Current CertUrl in default "" domains in config.json:

      "CertUrl": "https://mesh.example.com:443/",

All NGINX proxies are configured with stream details as follows:

stream {
    # Internal MPS servers, in this case we use one MeshCentral MPS server is on our own computer.
    upstream mpsservers {
        server 127.0.0.1:44330;
    }

    # We can use the MeshCentral generated MPS certificate & key
    ssl_certificate /home/app-meshcentral/meshcentral-data/mpsserver-cert-public.crt;
    ssl_certificate_key /home/app-meshcentral/meshcentral-data/mpsserver-cert-private.key;
    ssl_session_cache shared:MPSSSL:10m;
    #ssl_session_timeout 999999s;

    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;

    #include http-conf.d/20_https-options.conf;

    # MPS server
    server {
        listen 4433 ssl;
        proxy_pass mpsservers;
        proxy_next_upstream on;
    }
}

With the above configuration, no Intel AMT agents were displayed as connected in any MeshCentral instance. Debugging steps taken:

  1. Confirmed that anycast mesh.example.co is routable and pingable.
  2. Via tcpdump, confirmed that Intel AMT agents are able to communicate with the NGINX stream proxy.
  3. Via tcpdump, confirmed that NGINX stream proxy is communicating with MeshCentral via loopback adaptor.

As an aside, I can confirm that failover is working well otherwise. As can be expected when another host takes over the anycast IP, the websocket TCP connections drop, but a quick refresh has you back where you were immediately. Very nice!

Ylianst commented 5 years ago

Just to make sure I understand. Are you using Intel AMT with CIRA? That is, are all your AMT machines attempting to connect to the MPS server on NGINX port 4433? Or are Intel AMT machines being found on the local network without CIRA?

(Let's ignore the case where the servers are Intel AMT themselves for now)

Ylianst commented 5 years ago

Just published MeshCentral v0.4.0-a with a fix to allow Intel AMT KVM/Terminal connections to work across server peers. I got a loop-back AMT KVM session to happen for the first time (Server2 AMT connected using CIRA to Server1, then browse to Server2 and start the KVM session).

FYI: I am going on vacation for a week, but will be on email and will have access to the code if there is a big problem and Bryan will keep working this week on the agent.

mfw78 commented 5 years ago

Just to make sure I understand. Are you using Intel AMT with CIRA? That is, are all your AMT machines attempting to connect to the MPS server on NGINX port 4433? Or are Intel AMT machines being found on the local network without CIRA?

(Let's ignore the case where the servers are Intel AMT themselves for now)

That's correct, the Intel AMT machines are all connecting via CIRA to the MPS server on NGINX port 4433. I've just updated to v0.4.0-b via rm node_modules and reinstall MeshCentral. So far, it seems to be the same as the previous (No Intel AMT CIRA agents visible with server peering).

If there's any/command I can run to increase the amount of information I can give you for debugging purposes, please let me know.

Enjoy the vacation!

Ylianst commented 5 years ago

Just to get more details on "No Intel AMT CIRA agents visible with server peering". When you run a single server, you see the "CIRA" label show up for some computers? But when you do peering, you don't see this regardless of which server the browser is connected to? Can you confirm that one server work, but when adding a second CIRA fails on both? Any details appreciated.

I did not specify this previously, but both servers need to have the same set of certificates. That is, you should copy "meshcentral-data" from server1 and place it in server2 and only change the "serverid" between both servers. This said, in your case NGINX will take care of the certificate on port 4433, so that is probably not the issue.

mfw78 commented 5 years ago

Hi, sorry for the delay, I've been travelling. I'll run some more diagnostics, there is a good chance at the moment it may be a routing problem at my end, so I'll get back to you.

Ylianst commented 5 years ago

No worries. Server peering in it's very basic form is working for me, but more work will have to be done for peering to be ready for prime-time.