CRITICAL ALERT - Githubissues

shentino commented 2 years ago

Sorry for falling behind with maintenance, RL has been cruel with multiple deaths in the family lately

However I discovered that the server hosting gables had run out of disk space which I found out when an attempt to perform a routine update started spamming me with "no space left" errors

A quick df confirmed the issue

I did an emergency delete of all snapshot backups from 2020 to free up some space, and I'm rerunning the maintenance script now from scratch

But be warned we may want to consider rethinking our backup strategy with automatic or periodic purges of old files and/or upgrading the VM's storage

shentino commented 2 years ago

@ChristopherA @noahgibbs @saraht45

noahgibbs commented 2 years ago

Thank you for checking this! I'll try to have a look this evening, and update this thread when I do.

shentino commented 2 years ago

We're lucky the dgd instance didn't crash, running out of disk space to write snapshot information would have been a fatal error.

saraht45 commented 2 years ago

Good catch @shentino

shentino commented 2 years ago

We're out of space again, I'm deleting more backups

shentino commented 2 years ago

All snapshot backups from 2021 have been deleted

noahgibbs commented 2 years ago

Oh, huh, looks like something is generating huge amounts of userdb.log entries (see: /var/log/userdb*.log). Looks like AuthD is recording login attempts that are all garbage.

noahgibbs commented 2 years ago

And it's just been happening for under a week, since the 2nd or so.

noahgibbs commented 2 years ago

Yeah, something's hitting that port over and over. It's also produced many GB of driver logfile in the port 6000 DGD directory. I'm gonna delete some of that since it's just DGD logging receipt of garbage over and over.

noahgibbs commented 2 years ago

I'll cut out some of the huge userdb logs as well. We still have plenty with garbage, but we don't need many GB worth of it.

shentino commented 2 years ago

Oh, huh, looks like something is generating huge amounts of userdb.log entries (see: /var/log/userdb*.log). Looks like AuthD is recording login attempts that are all garbage.

This is a side effect of exposing that port to the wild internet when previously it was firewalled behind skotos's private network.

Skotos's original infrastructure IIRC assumed that a central server would handle all "skotos account" logins

I've seen the same spam getting generated when I was working with @shannona

The garbage is coming from the internet.

I'd like that hole closed sooner rather than later if possible, it's a potential security problem if anyone manages to brute force anything

noahgibbs commented 2 years ago

It looks like it's coming in via the DGD server, not directly to the AuthD (good, that shouldn't be exposed to the internet.)

shentino commented 2 years ago

my mistake then but yes I remember exactly where this garbage is coming from, spambots on the internet knocking on the wrong door.

As it is I already have kotaka automatically siteban these IPs for 90 days

If the DGD server is forwarding them as is that could still be a security issue if they aren't being screened.

noahgibbs commented 2 years ago

I'm not sure how we'd fully close the hole, since we need something available that can accept an internet login. We should probably stop logging every failed login, though.

shentino commented 2 years ago

I'm not sure how we'd fully close the hole, since we need something available that can accept an internet login. We should probably stop logging every failed login, though.

I think it's still valuable to log that they're happening

BUT

I think we should definitely ratelimit them, maybe only log one failure every 5 minutes.

On the first one, raise a masking flag and set a callout for 300 seconds the expiry of which will drop the flag again.

You also helped me realize this is also a potential denial of service vector.

Definitely go ahead and stop the log spam, a hostile attacker may deliberately crapflood us for the sole purpose of hogging all our disk space with junk

noahgibbs commented 2 years ago

Right now DGD tries the login with AuthD. Both AuthD and DGD log the attempt. So that's how all that space is getting used up.

We could reduce the volume massively by just kicking out everything that doesn't look at all like a username. For instance, I see many attempts with the exact same noise string here, "n� ��a3� ��u��3�� p�!��p�?� �^���x�q�?�6��".

noahgibbs commented 2 years ago

And either they've stopped or they come in batches. I've been sitting here for a bit with "tail -f userdb.log" open, and nothing new has come in. Yeah, looks like the last attempts were at around 2pm my time (it's 4pm right now.) But the rate seems like it's dropped a lot, at least for now.

shentino commented 2 years ago

That looks a lot like the same kind of spam that's tripping my automatic siteban script in kotaka.

noahgibbs commented 2 years ago

It does look like they managed to get DGD to pass garbage to AuthD somehow. It's logging a bunch of "bad command" records.

noahgibbs commented 2 years ago

I'd have to disagree with your broadcast message there - a denial-of-service isn't great, but much better than exposing user data ;-)

It's also very, very hard to prevent all possible denial-of-service attacks. But we get to play on easy mode, since a denial-of-service on The Gables doesn't cost us money, so we're not likely to pay a ransom.

shentino commented 2 years ago

Not for us personally but it helps skotoslib be more robust for downstream users installing it

That was the whole point of putting skotos 2.0.0.0.0.0.0.0-alpha99 on github in the first place

and reread the broadcast, I never said a DoS was a good thing :P

noahgibbs commented 2 years ago

Hm. Doesn't look like we rule out any particular characters from existing usernames. I should check how many we have that currently use unexpected characters.

noahgibbs commented 2 years ago

Hm. We currently have 41 users in the DB, and the only even-slightly-unusual character in a name is an underscore. That suggests that 1) the brute forcing hasn't done anything too awful as yet and 2) this would be a fantastic time to add restrictions on what characters go in valid usernames, at least as far as The Gables is concerned.

And I suspect you don't see a lot of folks upgrading their thin-auth installations, though it would be nice to be wrong about that.

noahgibbs commented 2 years ago

The database also sets the character set to utf8, so we could also disqualify any name and/or pw that isn't valid utf8.

noahgibbs commented 2 years ago

Hm. Definitely saw a bunch of garbage logins through the DGD logfiles. Not sure if this is going through thin-auth, DGD or both.

noahgibbs commented 2 years ago

At least some of this is coming through thin-auth's login.php. While it does not log invalid login attempts (good), it's generating a couple of PHP warnings every time it hits the page, so we still get a giant logfile (/var/log/apache2/login-error).

Though 'giant' is relative. The biggest one I see is only 1.1 MB, so far less than we're getting for some of the other services.

noahgibbs commented 2 years ago

Haven't seen a lot more hammering on the doors. I have a thin-auth patch I'm testing locally to reject non-UTF-8 usernames as garbage, and we already throw out anything under 4 chars or over 30. But I think short-term, the danger/annoyance is probably past.

noahgibbs commented 2 years ago

I should also make a DGD patch. I think some of those log entries have to be from them hitting the DGD port.

Here we go, checking skoot/log/driver.log.1662594328 , there are a lot of these:

Sep  7 19:42:50 ** error:BAD INPUT: " ERR BAD COMMAND ([Sep)"
error:     /kernel/obj/binary#8011
error:  89     /kernel/obj/binary             receive_message
error: 203     /kernel/lib/connection         receive_message
error:     /usr/UserAPI/obj/authd_tcp#8167
error:  39     /usr/UserAPI/obj/authd_tcp     receive_message
error:     /usr/UserAPI/sys/authd_port
error: 112     -                              receive_message
error:     /usr/UserAPI/sys/authd
error: 142     -                              receive_message
Sep  7 19:42:50 ** error:BAD INPUT: " ERR BAD COMMAND ([Sep)"
error:     /kernel/obj/binary#8011
error:  89     /kernel/obj/binary             receive_message
error: 203     /kernel/lib/connection         receive_message
error:     /usr/UserAPI/obj/authd_tcp#8167
error:  39     /usr/UserAPI/obj/authd_tcp     receive_message
error:     /usr/UserAPI/sys/authd_port
error: 112     -                              receive_message
error:     /usr/UserAPI/sys/authd
error: 142     -                              receive_message

Just in general, du -h thinks there's 14GB of stuff in there and I don't see nearly the size of files in there that I'd expect for that. So somebody probably has deleted files open and writing or something along those lines.

Also, all the recent driver.log files are 0-size, so that's probably not right either. And it looks like that started 7th Sept, and there's nothing later than 7th Sept. Hrm.

noahgibbs commented 2 years ago

I can get in fine, though, both via web interface and wiz port.

shentino commented 2 years ago

I should also make a DGD patch. I think some of those log entries have to be from them hitting the DGD port.

Here we go, checking skoot/log/driver.log.1662594328 , there are a lot of these:
Sep  7 19:42:50 ** error:BAD INPUT: " ERR BAD COMMAND ([Sep)"
error:     /kernel/obj/binary#8011
error:  89     /kernel/obj/binary             receive_message
error: 203     /kernel/lib/connection         receive_message
error:     /usr/UserAPI/obj/authd_tcp#8167
error:  39     /usr/UserAPI/obj/authd_tcp     receive_message
error:     /usr/UserAPI/sys/authd_port
error: 112     -                              receive_message
error:     /usr/UserAPI/sys/authd
error: 142     -                              receive_message
Sep  7 19:42:50 ** error:BAD INPUT: " ERR BAD COMMAND ([Sep)"
error:     /kernel/obj/binary#8011
error:  89     /kernel/obj/binary             receive_message
error: 203     /kernel/lib/connection         receive_message
error:     /usr/UserAPI/obj/authd_tcp#8167
error:  39     /usr/UserAPI/obj/authd_tcp     receive_message
error:     /usr/UserAPI/sys/authd_port
error: 112     -                              receive_message
error:     /usr/UserAPI/sys/authd
error: 142     -                              receive_message
Just in general, du -h thinks there's 14GB of stuff in there and I don't see nearly the size of files in there that I'd expect for that. So somebody probably has deleted files open and writing or something along those lines.

Also, all the recent driver.log files are 0-size, so that's probably not right either. And it looks like that started 7th Sept, and there's nothing later than 7th Sept. Hrm.

I STRONGLY advise against patching DGD itself, that will break downstream compatibility for our userbase.

DGD is designed to be flexible and it's usually the responsibility of LPC code, in this case skotoslib and/or the included kernel library.

Your "neo archaeology" column and/or your previous commentary on the DGD mailing list would bear this out as well.

Instead of messing with DGD itself you should probably trace which actual port is being hit and with that which "handler" is taking care of connections on that port. The log messages are from send_message being used in the driver object somehow, and you can probably trace the logging back.

In this case since it's an error message with a stack trace you're more likely to find the offending traffic and with it the handling code

As for the logs, if you run out of space it's likely that the driver log is going to be corrupted because it's a file based redirection of dgd's stderr.

I'll be rebooting the server shortly to reset everything.

shentino commented 2 years ago

In this case the "DGD" port is actually either the klib's built-in admin port or the skotoslib version with added bells and whistles

DGD code itself isn't handling the connections, merely passing them up the food chain to the LPC layer.

I say again, please DO NOT tamper with DGD's source code

noahgibbs commented 2 years ago

No, I just meant patching code written in DGD, much as if I said "a Ruby patch" to mean a patch in Ruby code. This is by contrast with thin-auth, which is in PHP.

shentino commented 2 years ago

Ok cool, taken literally meant something completely different.

shentino commented 2 years ago

Got another "out of space" error while doing routine maintenance

An "arms length" du fingered /var/skotos/6000/skoot/log as the culprit

I purged all "driver.log" files in order of age, oldest first, until enough space was freed up by the following command:

rm -rf driver.log.1662*

Disk usage is now back down to 87 percent

One of the log files was 23200284672 bytes long

23,200,284,672 - 23G

I think we found the dragon turd on the street ^.^

The 23G log file has now been deleted

shentino commented 2 years ago

Correction, that's the size of the log that hasn't been deleted, but enough space was recovered by deleting older logs that there must have been another space hog buried in the past.

Since this was an emergency deletion I won't be tampering with any more files and I only cleaned these up because I was able to reverse engineer the timestamp suffixes.

shentino commented 2 years ago

Just deleted all empty log files and just now removed another whopper dgd log

23200284672 bytes long

shentino commented 2 years ago

The dgd error log is filling up with a lot of this:

Oct 9 15:20:07 ** error:BAD INPUT: " ERR BAD COMMAND ([Oct)" error: /kernel/obj/binary#22445 error: 89 /kernel/obj/binary receive_message error: 203 /kernel/lib/connection receive_message error: /usr/UserAPI/obj/authd_tcp#8183 error: 39 /usr/UserAPI/obj/authd_tcp receive_message error: /usr/UserAPI/sys/authd_port error: 112 - receive_message error: /usr/UserAPI/sys/authd error: 142 - receive_message

ChatTheatre / Community

CRITICAL ALERT #22