Open shentino opened 2 years ago
@ChristopherA @noahgibbs @saraht45
Thank you for checking this! I'll try to have a look this evening, and update this thread when I do.
We're lucky the dgd instance didn't crash, running out of disk space to write snapshot information would have been a fatal error.
Good catch @shentino
We're out of space again, I'm deleting more backups
All snapshot backups from 2021 have been deleted
Oh, huh, looks like something is generating huge amounts of userdb.log entries (see: /var/log/userdb*.log). Looks like AuthD is recording login attempts that are all garbage.
And it's just been happening for under a week, since the 2nd or so.
Yeah, something's hitting that port over and over. It's also produced many GB of driver logfile in the port 6000 DGD directory. I'm gonna delete some of that since it's just DGD logging receipt of garbage over and over.
I'll cut out some of the huge userdb logs as well. We still have plenty with garbage, but we don't need many GB worth of it.
Oh, huh, looks like something is generating huge amounts of userdb.log entries (see: /var/log/userdb*.log). Looks like AuthD is recording login attempts that are all garbage.
This is a side effect of exposing that port to the wild internet when previously it was firewalled behind skotos's private network.
Skotos's original infrastructure IIRC assumed that a central server would handle all "skotos account" logins
I've seen the same spam getting generated when I was working with @shannona
The garbage is coming from the internet.
I'd like that hole closed sooner rather than later if possible, it's a potential security problem if anyone manages to brute force anything
It looks like it's coming in via the DGD server, not directly to the AuthD (good, that shouldn't be exposed to the internet.)
my mistake then but yes I remember exactly where this garbage is coming from, spambots on the internet knocking on the wrong door.
As it is I already have kotaka automatically siteban these IPs for 90 days
If the DGD server is forwarding them as is that could still be a security issue if they aren't being screened.
I'm not sure how we'd fully close the hole, since we need something available that can accept an internet login. We should probably stop logging every failed login, though.
I'm not sure how we'd fully close the hole, since we need something available that can accept an internet login. We should probably stop logging every failed login, though.
I think it's still valuable to log that they're happening
BUT
I think we should definitely ratelimit them, maybe only log one failure every 5 minutes.
On the first one, raise a masking flag and set a callout for 300 seconds the expiry of which will drop the flag again.
You also helped me realize this is also a potential denial of service vector.
Definitely go ahead and stop the log spam, a hostile attacker may deliberately crapflood us for the sole purpose of hogging all our disk space with junk
Right now DGD tries the login with AuthD. Both AuthD and DGD log the attempt. So that's how all that space is getting used up.
We could reduce the volume massively by just kicking out everything that doesn't look at all like a username. For instance, I see many attempts with the exact same noise string here, "n� ���a3� ��u���3�� p�!��p�?� �^����x�q�?�6��".
And either they've stopped or they come in batches. I've been sitting here for a bit with "tail -f userdb.log" open, and nothing new has come in. Yeah, looks like the last attempts were at around 2pm my time (it's 4pm right now.) But the rate seems like it's dropped a lot, at least for now.
That looks a lot like the same kind of spam that's tripping my automatic siteban script in kotaka.
It does look like they managed to get DGD to pass garbage to AuthD somehow. It's logging a bunch of "bad command" records.
I'd have to disagree with your broadcast message there - a denial-of-service isn't great, but much better than exposing user data ;-)
It's also very, very hard to prevent all possible denial-of-service attacks. But we get to play on easy mode, since a denial-of-service on The Gables doesn't cost us money, so we're not likely to pay a ransom.
Not for us personally but it helps skotoslib be more robust for downstream users installing it
That was the whole point of putting skotos 2.0.0.0.0.0.0.0-alpha99 on github in the first place
and reread the broadcast, I never said a DoS was a good thing :P
Hm. Doesn't look like we rule out any particular characters from existing usernames. I should check how many we have that currently use unexpected characters.
Hm. We currently have 41 users in the DB, and the only even-slightly-unusual character in a name is an underscore. That suggests that 1) the brute forcing hasn't done anything too awful as yet and 2) this would be a fantastic time to add restrictions on what characters go in valid usernames, at least as far as The Gables is concerned.
And I suspect you don't see a lot of folks upgrading their thin-auth installations, though it would be nice to be wrong about that.
The database also sets the character set to utf8, so we could also disqualify any name and/or pw that isn't valid utf8.
Hm. Definitely saw a bunch of garbage logins through the DGD logfiles. Not sure if this is going through thin-auth, DGD or both.
At least some of this is coming through thin-auth's login.php. While it does not log invalid login attempts (good), it's generating a couple of PHP warnings every time it hits the page, so we still get a giant logfile (/var/log/apache2/login-error).
Though 'giant' is relative. The biggest one I see is only 1.1 MB, so far less than we're getting for some of the other services.
Haven't seen a lot more hammering on the doors. I have a thin-auth patch I'm testing locally to reject non-UTF-8 usernames as garbage, and we already throw out anything under 4 chars or over 30. But I think short-term, the danger/annoyance is probably past.
I should also make a DGD patch. I think some of those log entries have to be from them hitting the DGD port.
Here we go, checking skoot/log/driver.log.1662594328 , there are a lot of these:
Sep 7 19:42:50 ** error:BAD INPUT: " ERR BAD COMMAND ([Sep)"
error: /kernel/obj/binary#8011
error: 89 /kernel/obj/binary receive_message
error: 203 /kernel/lib/connection receive_message
error: /usr/UserAPI/obj/authd_tcp#8167
error: 39 /usr/UserAPI/obj/authd_tcp receive_message
error: /usr/UserAPI/sys/authd_port
error: 112 - receive_message
error: /usr/UserAPI/sys/authd
error: 142 - receive_message
Sep 7 19:42:50 ** error:BAD INPUT: " ERR BAD COMMAND ([Sep)"
error: /kernel/obj/binary#8011
error: 89 /kernel/obj/binary receive_message
error: 203 /kernel/lib/connection receive_message
error: /usr/UserAPI/obj/authd_tcp#8167
error: 39 /usr/UserAPI/obj/authd_tcp receive_message
error: /usr/UserAPI/sys/authd_port
error: 112 - receive_message
error: /usr/UserAPI/sys/authd
error: 142 - receive_message
Just in general, du -h thinks there's 14GB of stuff in there and I don't see nearly the size of files in there that I'd expect for that. So somebody probably has deleted files open and writing or something along those lines.
Also, all the recent driver.log files are 0-size, so that's probably not right either. And it looks like that started 7th Sept, and there's nothing later than 7th Sept. Hrm.
I can get in fine, though, both via web interface and wiz port.
I should also make a DGD patch. I think some of those log entries have to be from them hitting the DGD port.
Here we go, checking skoot/log/driver.log.1662594328 , there are a lot of these:
Sep 7 19:42:50 ** error:BAD INPUT: " ERR BAD COMMAND ([Sep)" error: /kernel/obj/binary#8011 error: 89 /kernel/obj/binary receive_message error: 203 /kernel/lib/connection receive_message error: /usr/UserAPI/obj/authd_tcp#8167 error: 39 /usr/UserAPI/obj/authd_tcp receive_message error: /usr/UserAPI/sys/authd_port error: 112 - receive_message error: /usr/UserAPI/sys/authd error: 142 - receive_message Sep 7 19:42:50 ** error:BAD INPUT: " ERR BAD COMMAND ([Sep)" error: /kernel/obj/binary#8011 error: 89 /kernel/obj/binary receive_message error: 203 /kernel/lib/connection receive_message error: /usr/UserAPI/obj/authd_tcp#8167 error: 39 /usr/UserAPI/obj/authd_tcp receive_message error: /usr/UserAPI/sys/authd_port error: 112 - receive_message error: /usr/UserAPI/sys/authd error: 142 - receive_message
Just in general, du -h thinks there's 14GB of stuff in there and I don't see nearly the size of files in there that I'd expect for that. So somebody probably has deleted files open and writing or something along those lines.
Also, all the recent driver.log files are 0-size, so that's probably not right either. And it looks like that started 7th Sept, and there's nothing later than 7th Sept. Hrm.
I STRONGLY advise against patching DGD itself, that will break downstream compatibility for our userbase.
DGD is designed to be flexible and it's usually the responsibility of LPC code, in this case skotoslib and/or the included kernel library.
Your "neo archaeology" column and/or your previous commentary on the DGD mailing list would bear this out as well.
Instead of messing with DGD itself you should probably trace which actual port is being hit and with that which "handler" is taking care of connections on that port. The log messages are from send_message being used in the driver object somehow, and you can probably trace the logging back.
In this case since it's an error message with a stack trace you're more likely to find the offending traffic and with it the handling code
As for the logs, if you run out of space it's likely that the driver log is going to be corrupted because it's a file based redirection of dgd's stderr.
I'll be rebooting the server shortly to reset everything.
In this case the "DGD" port is actually either the klib's built-in admin port or the skotoslib version with added bells and whistles
DGD code itself isn't handling the connections, merely passing them up the food chain to the LPC layer.
I say again, please DO NOT tamper with DGD's source code
No, I just meant patching code written in DGD, much as if I said "a Ruby patch" to mean a patch in Ruby code. This is by contrast with thin-auth, which is in PHP.
Ok cool, taken literally meant something completely different.
Got another "out of space" error while doing routine maintenance
An "arms length" du fingered /var/skotos/6000/skoot/log as the culprit
I purged all "driver.log" files in order of age, oldest first, until enough space was freed up by the following command:
rm -rf driver.log.1662*
Disk usage is now back down to 87 percent
One of the log files was 23200284672 bytes long
23,200,284,672 - 23G
I think we found the dragon turd on the street ^.^
The 23G log file has now been deleted
Correction, that's the size of the log that hasn't been deleted, but enough space was recovered by deleting older logs that there must have been another space hog buried in the past.
Since this was an emergency deletion I won't be tampering with any more files and I only cleaned these up because I was able to reverse engineer the timestamp suffixes.
Just deleted all empty log files and just now removed another whopper dgd log
23200284672 bytes long
The dgd error log is filling up with a lot of this:
Oct 9 15:20:07 ** error:BAD INPUT: " ERR BAD COMMAND ([Oct)" error: /kernel/obj/binary#22445 error: 89 /kernel/obj/binary receive_message error: 203 /kernel/lib/connection receive_message error: /usr/UserAPI/obj/authd_tcp#8183 error: 39 /usr/UserAPI/obj/authd_tcp receive_message error: /usr/UserAPI/sys/authd_port error: 112 - receive_message error: /usr/UserAPI/sys/authd error: 142 - receive_message
Sorry for falling behind with maintenance, RL has been cruel with multiple deaths in the family lately
However I discovered that the server hosting gables had run out of disk space which I found out when an attempt to perform a routine update started spamming me with "no space left" errors
A quick df confirmed the issue
I did an emergency delete of all snapshot backups from 2020 to free up some space, and I'm rerunning the maintenance script now from scratch
But be warned we may want to consider rethinking our backup strategy with automatic or periodic purges of old files and/or upgrading the VM's storage