Ylianst / MeshCentral

A complete web-based remote monitoring and management web site. Once setup you can install agents and perform remote desktop session to devices on the local network or over the Internet.
https://meshcentral.com
Apache License 2.0
4.14k stars 554 forks source link

Most all Agents disconnected in 0.7.85? #2380

Closed southeasterntech closed 3 years ago

southeasterntech commented 3 years ago

Suddenly without explanation, a vast majority of clients are now disconnected on 0.7.85. Running the installer again produces the update button and it reconnects. But I can't do that across 1000 EP's... any idea on how to fix? Restarting the server didn't help.

image

Ylianst commented 3 years ago

Do you know what version you where running before? There has not been an agent change in while.

In "My Server" / "Console" type "agentstats" and let me know what you see.

image

Also, any errors in the general tab:

image

Also, add the following line to the config.json settings section:

"ignoreagenthashcheck": true

Then restart the server and let me know what you see.

Ylianst commented 3 years ago

One more thing I would check. Is it possible your running two instances of MeshCentral at the same time? If instance 1 gets all the agent connections and your looking at instance 2... then re-installing the agent makes the agent connect to instance 2.

One way to notice this is to compare the meshagent.msh of previous devices with the new meshagent.msh your installing. If there is a difference in the ServerID or MeshServer lines, there is a problem. .msh should look like this:

MeshName=Lab Computers
MeshType=2
MeshID=0xEDBE1BE37...B7DB6B4E7971EF34D36EBB6B875CF3D7DED1EE7CD5C
ServerID=D99362D5ED8...D707403E396CF0EF6DC2B3A42F735135FD
MeshServer=wss://central.mesh.meshcentral.com:443/agent.ashx
webSocketMaskOverride=1

Another issue that could have happened is that you removed the certificates in "meshcentral-data" causing your server to no longer be trusted by agents. Comparing the old and now .msh would indicate what is going on.

Ylianst commented 3 years ago

Also, if you are running plugins or a reverse proxy, let me know.

southeasterntech commented 3 years ago

Yes on the reverse proxy (caddy) Here are the screens

Thanks,

Shane D. Lewis

On Mar 11, 2021, at 5:53 PM, Ylian Saint-Hilaire @.***> wrote:



Also, if you are running plugins or a reverse proxy, let me know.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

southeasterntech commented 3 years ago

oops, looks like he screens didn't attach... checking for duplicate instances now. IMG_0291 IMG_0292

southeasterntech commented 3 years ago

Also agent hash check has been disabled for a couple of years now... I don't see much in the way of a second server running, only the Mesh Service is running, not a node instance also that I can tell...restarting caddy reverse proxy too... let's see what happens

Ylianst commented 3 years ago

So far, nothing obviously incorrect. The "coreIsStableCount" is low. Looks like agents are trying to connect, authenticating correctly and then disconnecting. I would look at the reverse proxy first for sure. By the way, I am running v0.7.85 on MeshCentral.com with 10k devices. I don't run a reverse proxy, but I don't have any obvious concerns about that version.

If there is a connectivity issue and it's fixed, it will take 10 to 20 minutes for agents to slowly reconnect.

southeasterntech commented 3 years ago

Caddy reboot didnt fix.

But.... I go to three existing, not connected computers and mesh server ID is ending in : ....B76A

However, when I open Mesh...copy the invite link and then attempt to run the installer, the installer Server ID ends in :...44E3

So it does indeed seem I have a second server running.... or a different server running.... So, I have no visible NodeJS instances that I see and I only have the Mesh Central Service running.... and ideas on how to spot the rogue server?

image

Ylianst commented 3 years ago

Is the entire hash different? Or just the ending? If it's a different server, the hash would be completely different. What counts is what is in the .msh file, maybe the UI is not showing the full hash.

Assuming the hash is completely different, the main problem is that you have a different "meshcentral-data" folder in your server. The root cause is that these two files in "meshcentral-data" are different from the original:

agentserver-cert-public.crt
agentserver-cert-private.key

So, if you find these two files from a previous backup and put them back to the original and reset the server, all the agents will see your server as correct and accept to connect.

Make sure to backup your current "meshcentral-data" before making any changes.

southeasterntech commented 3 years ago

sorry for the kindergarten scrawl.. Before Update image After Update image

southeasterntech commented 3 years ago

What could have caused a different set of certs\data folder to suddenly appear? We didn't restore an old backup, snapshot or anything that I know of... Granted we have techs doing stuff in the office but I'd have known if they'd have messed with this server. Thanks..

I mean I can go to the server backup from 2 weeks ago, copy these certs and bring them over...

Now that I think about it.... back in January, I copied the data folder from an old server running NeDB to this new server running MongoDB and it's been much more stable.... I wonder if something finally caught up to me, cert expiration or something?...no clue.

southeasterntech commented 3 years ago

New Server Cert image Old Server image

Ylianst commented 3 years ago

This is certainly the issue. The "ServerID" is used by the agent to authenticate the server. When connecting, the agent will ask the server to prove it's correct and the server must sign a random string given by the server with the "agentserver" certificate. This can't be bypassed. The agents will refuse to continue the connection unless the server is correct.

Your server can't have two "agentserver" certificates... so, it's going to have to be the old one or the new one.

Lastly, you can change the meshagent.msh file to what you like and restart the agent. No need to re-install the agent.

I can't help you with what happened, but you sure root caused the issue exactly right.

Ylianst commented 3 years ago

Also, if you can take a look if a "mesherrors.txt" file is in "meshcentral-data". If there is anything in there, would be interesting to see. Maybe there is a hint.

southeasterntech commented 3 years ago

So...if I swap in the certs from the old server.... then this would work...to the exclusion of anything that was working on the new server.... right? Here is the stats from this week.... you can see a massive drop in registrations... so maybe I should go back to a snapshot a couple of days back and copy in the data folder...

image

Thanks so much Ylian... looks like we're close here.

Ylianst commented 3 years ago

You are exactly correct. If you put the old cert, the new agents will not connect anymore.

southeasterntech commented 3 years ago

I'm going to go back to a snapshot from Mar 2 and grab the data folder and compare\drop that in and see what goes down... be interesting to see what changed.

southeasterntech commented 3 years ago

So.... in the last 9 days...the certs are different.... pulling the data folder from Mar 2, the keys aren't the same. Any idea you can think of that could have caused the certs to regenerate? Swapped in the old data folder and many more devices are beginning to populate... thanks so much Ylian... We'll just have to go back and re-add the 40 or so newest agents. Thanks a million, as always you're incredible..

image

Ylianst commented 3 years ago

Glad I could help. I don't have any idea what could have happened, I have not gotten a report like this before. Can you take a look at "mesherrors.txt" in "meshcentral-data" and report back if there is anything in that file? Thanks.

southeasterntech commented 3 years ago

Nothing in there at all other than a missing PNG file with our logo....which I just now restored from backup...... shrug. OK filing this one under the Bigfoot and Nessie folder..... Moving on to more important things. Thanks as always for the help. Shane

Ylianst commented 3 years ago

OK. Thanks. Still would have been nice to get a report on mesherror.txt.

southeasterntech commented 3 years ago

mesherrors.txt

No Problem, here's mesherrors.txt, it was working fine on 3\2\2021 so I didn't go past that....

southeasterntech commented 3 years ago

@Ylianst see above, thanks again