Enkidu-6 / tor-ddos

iptables rules for Tor relay operators to mitigate ddos
https://enkidu-6.github.io/tor-ddos/
56 stars 8 forks source link

many overload messages #2

Closed ToterEngel closed 1 year ago

ToterEngel commented 2 years ago

I think the script is well done. At least clearer that even a noob can understand it.

I have now tested these rules for a day, but with half the blocking time.

The number of incoming connections was also almost halved. However, I still have a lot of overload messages in the log.

Are there any other solutions to further optimize it?

Enkidu-6 commented 2 years ago

The current ddos attack is primarily designed to get your system to shut down either by pushing the RAM usage to maximum or by filling up your available ports for outgoing connection by creating several concurrent connections. The Ntor drops are a side effect of tor not being able to handle them.

The only way to get rid of the NTor drops is to increase the number of CPUs until such time tor is patched and fixed to handle the connections better. I have been running and tweaking these rules for quite a while now. I believe the 12 hour timeout is a good balance between aggressive and a lax approach. Each time an IP is released from your block list, you're giving it another chance to make another two connections..

Not sure if you also applied the tweaks I mentioned for your sysctl.conf. But all of them combined will allow your system to run for a long time without having to reboot. It will run at a steady RAM and CPU usage with occasional RAM spikes which will soon recover. Ignore the NTor drops and let the system run. Once you get the HSDIR flag which takes 4 days, the attacks will noticeably increase

My relays are running for the past 30 days with a very steady RAM usage and they still show as green. Although for the past few days the attacks have severely increased. The number of blocked IP addresses have doubled and my relays have shown as overloaded 3 times for the past 4 days. But they go back to green within 6 to 12 hours..

ToterEngel commented 2 years ago

I had accepted all changes in the sysctl.conf. For weeks I have also been using the Tor internal DDOS protection with the following config:

DoSConnectionEnabled 1 DoSConnectionMaxConcurrentCount 2 DoSCircuitCreationEnabled 1 DoSCircuitCreationMinConnections 2 DoSCircuitCreationRate 2 DoSCircuitCreationDefenseTimePeriod 3600 DoSRefuseSingleHopClientRendezvous 1

However, this is not very efficient.

Do you have any experience that the network adapter can still be optimized? An Intel I219-LM is installed on my Hetzner server.

Enkidu-6 commented 2 years ago

The NTor drop as I mentioned is a problem with Tor not being able to handle it. It's not that it won't receive the packets, it's the fact that it receives them but it can't handle it. The only way to mitigate that to a certain point is to increase the number of CPUs.

The iptables rules are designed to make sure your server will run for a long time and at a manageable steady RAM and CPU usage but they can't solve the application shortcomings.

Search for Enabling Multi-Queue on Network Devices but chances are it's already enabled. ethtool -l eth0 or whatever your network interface is.

Enkidu-6 commented 2 years ago

@ToterEngel the attacks seem to have calmed down a bit. How's your system performing now?

ToterEngel commented 2 years ago

The CPU load has dropped a bit since yesterday. Now the blocked IPs have also been reduced from around 1800 to 1000. The blocked relays also level off between 25-30. (over 70 yesterday)

However, the relay is still displayed as overloaded. So something is still running in the background.

Enkidu-6 commented 2 years ago

What is the specs for your server e.g. RAM CPU, MaxAdvertisedBandwidth. Also looking in your logs, what is the percentage of your NTor drops?

ToterEngel commented 2 years ago

I have an i7-7700 with 64GB of RAM. The targeted bandwidth is currently running at 60mbit. However, a reduction to 40mbit had made no difference in the long term.

The overload is at least only slightly present. 2-3 days ago it was 2-5% currently: Ntor dropped (288125) fraction 0.5897% is above threshold of 0.5000%

cybermonkee commented 1 year ago

I find my Ntor stats vary wildly - sometimes it is a tiny amount and sometimes it is double figures. I have a N5105 with 8GB RAM dedicated, and a 200mbit connection dedicated for this purpose. But you are right, as soon as I get the HSDir flag stuff goes nuts DDoS wise... I think removing the Dir port might have been to spread the load amongst all relays to try and mitigate.

I was worried that my CPU was not fast enough but a N5105 with AESNI and 8GB RAM should be plenty... I mean this thing can run Graphical Windows!!! - so a tiny Linux Kernel, and some network routing and TLS offloading should be easy - right?

ToterEngel commented 1 year ago

It varies a lot for me too. In the last few days, however, the nature of the attack has changed somewhat.

The number of blocked relays has dropped from 30-40 to 12. But the blocked IPs today increased from 1000-1500 to over 5000. Here it doesn't matter to me whether the HSDIR flag is present or not in my observations.

cybermonkee commented 1 year ago

I am currently blocking 85 relays... on the tor-ddos hash I have 3061 blocked addresses - it's creeping up... I have reduced my bandwidth rate and bust a bit to see what happens.

Superpaul209 commented 1 year ago

I was worried that my CPU was not fast enough but a N5105 with AESNI and 8GB RAM should be plenty... I mean this thing can run Graphical Windows!!! - so a tiny Linux Kernel, and some network routing and TLS offloading should be easy - right?

My relay has a CPU without AESNI (its a CPU from 2008) and it always get the overloaded status... it has 4 cores and 4 GB RAM. Hopefully the anti ddos script will help to avoid those issues

cybermonkee commented 1 year ago

What bandwidth are you donating?

Superpaul209 commented 1 year ago

What bandwidth are you donating?

Im donating 8 MB/s maximum but it doesn't hit that much bandwidth. Its 5 or 6 MB/s in average.

cybermonkee commented 1 year ago

I am offering 12MBytes per second with a burst of 18Mbytes.

I do think for the level you are offering you should have a AESNI capable CPU - it does make a big difference.

Enkidu-6 commented 1 year ago

The attacks increased since yesterday and going full force now, that's why the blocked numbers have increased. My blocked IPs generally are in the vicinity of 600 with about 20-25 relays in there. I just checked one of my relays and I have 4096 guests. It will pass.

My two relays are assigned 12 VCPU each and with the MaxAdvertisedBandwidth of 20 MiB (160 mbps) each. They've been running for almost 12 days with green status but one of them just turned yellow a few hours ago. The other one is still Green. The overloaded one generally goes back to Green within one or two heartbeats. Rarely longer.

Enkidu-6 commented 1 year ago

14 hours later. The Blocked IPs are down to 990 from over 4000. Both Relays are Green. One is still dealing with Ddos and the other one not so much. I removed relays from the block list on the calm one and keeping them in the list on the one under attack.

4 relays showed up in the list again almost immediately.

cybermonkee commented 1 year ago

My blocked list went down to below 500 earlier today - it is now back up to 1676 - but out of those only 54 are match to the relay list. On the persec list I have noticed that a few of them are usually registered to Russia.

ToterEngel commented 1 year ago

Since the last hour, my CPU load & the load has dropped significantly. Only 15 Relays left and ~4500 IPs banned. :-D

Wouldn't it make sense in terms of performance to switch the firewall to XDP, or does that make no difference with so few filter rules?

Enkidu-6 commented 1 year ago

@cybermonkee If your ORPort is a known port like 443, the persec list catches a lot of port scans and may not be related to the DDoS.

@ToterEngel iptables and mangle rules are directly processed at the kernel level. No middle man. Doesn't get faster and more efficient than that.

cybermonkee commented 1 year ago

Persec2 and tor2-ddos appear to be empty - is that right? I am running two relays on one host? One relay is 443 aligned and the other is 9051.

Chain PREROUTING (policy ACCEPT) target prot opt source destination ACCEPT tcp -- anywhere anywhere match-set allow-list src ACCEPT tcp -- anywhere anywhere match-set allow-list src tcp -- anywhere anywhere tcp dpt:https recent: SET name: tor-ddos side: source mask: 255.255.255.255 tcp -- anywhere anywhere tcp dpt:9051 recent: SET name: tor2-ddos side: source mask: 255.255.255.255 SET tcp -- anywhere anywhere tcp dpt:https flags:FIN,SYN,RST,ACK/SYN ctstate NEW limit: above 3/sec burst 4 mode srcip htable-expire 3500 add-set persec src SET tcp -- anywhere anywhere tcp dpt:9051 flags:FIN,SYN,RST,ACK/SYN ctstate NEW limit: above 3/sec burst 4 mode srcip htable-expire 3500 add-set persec2 src SET tcp -- anywhere anywhere tcp dpt:https #conn src/32 > 2 add-set tor-ddos src SET tcp -- anywhere anywhere tcp dpt:9051 #conn src/32 > 2 add-set tor2-ddos src DROP tcp -- anywhere anywhere match-set persec src DROP tcp -- anywhere anywhere match-set persec2 src DROP tcp -- anywhere anywhere match-set tor-ddos src DROP tcp -- anywhere anywhere match-set tor2-ddos src ACCEPT tcp -- anywhere anywhere tcp dpt:https ACCEPT tcp -- anywhere anywhere tcp dpt:9051

Yields:

Name: tor-ddos Type: hash:ip Revision: 4 Header: family inet hashsize 4096 maxelem 65536 timeout 43200 Size in memory: 275216 References: 2 Number of entries: 2319 Members: 20.168.30.28 timeout 22542 20.117.98.36 timeout 27874 132.226.207.189 timeout 11969 52.162.250.21 timeout 26253 13.89.0.172 timeout 26939 20.250.20.44 timeout 41541 20.3.99.115 timeout 43075 ...... ....

Name: persec Type: hash:ip Revision: 4 Header: family inet hashsize 4096 maxelem 65536 timeout 3600 Size in memory: 2384 References: 2 Number of entries: 2 Members: 45.141.84.171 timeout 3176 135.181.110.168 timeout 173

Name: tor2-ddos Type: hash:ip Revision: 4 Header: family inet hashsize 4096 maxelem 65536 timeout 43200 Size in memory: 272 References: 2 Number of entries: 0 Members:

Name: persec2 Type: hash:ip Revision: 4 Header: family inet hashsize 4096 maxelem 65536 timeout 3600 Size in memory: 272 References: 2 Number of entries: 0 Members:

Enkidu-6 commented 1 year ago

IPV6 lists don't get populated as much. Almost all attacks come from IPV4 addresses. In any case, you've only posted your iptables rules so I can't tell. Type: ip6tables -S -t mangle. If you see the rules, you're good.

cybermonkee commented 1 year ago

IPV6 lists don't get populated as much. Almost all attacks come from IPV4 addresses. In any case, you've only posted your iptables rules so I can't tell. Type: ip6tables -S -t mangle. If you see the rules, you're good.

Doh - I found out what I have done wrong... doh... Thanks for your help.

cybermonkee commented 1 year ago

I have been running this configuration for about 10 days now - and it has helped enormously. I have not had any overload messages. I made a small adaption to the scripts to collect the relay data directly and to collect the number of overloaded relays so that I can tell if there is a DDoS attack on.

curl -s 'https://onionoo.torproject.org/details?search=running:true' -o - | jq -cr '.relays[].overload_general_timestamp' | sort -nr | sed '/null/d' | awk '{ print substr( $0, 1, length($0)-3 ) }' | wc-l

The number of overloaded relays, and the number of relays on the blocked lists has dropped significantly, so I think some of the DDoS countermeasures in tor relay have started to have an effect too.

Enkidu-6 commented 1 year ago

The relay list on onionoo updates once an hour at the top of the hour but it takes another 15 minutes or so to show up, so pulling the list more frequently will just give you the same old list. which is why I modified the compare and remove scripts to just use your current list if it's newer than 60 minutes.

As for the number of relays in the block list, they vary greatly depending on your Guard probability percentage. As your Guard probability goes up and your relay probability goes down you'll see less relays in your block list and vise versa.

I also took your suggestion and set up a cron to remove the relays once a minute since the new modified remove.sh version won't put any load on the list server.

The best way to notice a DDoS attack is to monitor your RAM usage. Once you see a spike in RAM you can check your block list and watch it grow in front of your eyes.

I'm testing some modifications to the script so I can use less of conntrack and perhaps more of hashlimit as Conntrack is more memory intensive. So keep an eye on the repository for newer versions.

cybermonkee commented 1 year ago

I added the:

curl -s 'https://onionoo.torproject.org/details?search=running:true' -o - | jq -cr '.relays[].overload_general_timestamp' | sort -nr | sed '/null/d' | awk '{ print substr( $0, 1, length($0)-3 ) }' | wc-l

So I could monitor the overload status of the whole network.

Enkidu-6 commented 1 year ago

That list would be too old. From the time a server is overloaded until it actually shows as overloaded in the list may take about 6 hours and even longer. So by the time you find out, it's yesterday's news..

Today the attack was very severe, at least for me. For the first time ever the automatic DDoS protection by my provider was triggered. I received an email telling me my two relays are under attack and they're mitigating it. I checked the block list and sure enough I had 4000 IP addresses in there before the Anti DDoS by my provider was even triggered. The servers were running just fine. even before their mitigation was triggered. Just a big spike in RAM which went back to normal in about 5 to 10 minutes. No NTOR drops and no overload. The number of IP addresses in the block list has gone down too.

The most important benefit of these rules is that even if Tor is overloaded once in a while, it won't take a big Toll on your server and you won't have to reboot, unless you have very little RAM to spare. It will run happily and your RAM will recover and within a heartbeat or two Tor goes back to green.

Enkidu-6 commented 1 year ago

I added the:

curl -s 'https://onionoo.torproject.org/details?search=running:true' -o - | jq -cr '.relays[].overload_general_timestamp' | sort -nr | sed '/null/d' | awk '{ print substr( $0, 1, length($0)-3 ) }' | wc-l

So I could monitor the overload status of the whole network.

If all you're looking for is the number of overloaded relays perhaps you can use something simpler:

curl -s 'https://onionoo.torproject.org/details?type=relay&running=true' | grep overload_general | wc -l

Regardless, thank you for thinking of this. I think a list of IP addresses of overloaded relays can be useful. Perhaps to check them against the block list to see how many of the relays in the block list are actually the overloaded ones. I think I'll add that list to my tor-relay-lists repository for anyone who might be interested.

Thank you.

ToterEngel commented 1 year ago

I'm going to speak up again. It seems that there have been no major attacks on the network for a good week now. Almost all relays are currently green in the lists. For me, this calmed down so much that I was even able to start a second relay for the first time. Almost no NTor dropping at the moment.

Relay 1 (Guard): Blocked IPs: 4500 Connections: 6000 inbound / 2700 outbound

Relay 2 (middle): blocked IPs: 150 Connections: 2900 inbound / 3000 outbound

cybermonkee commented 1 year ago

I'm going to speak up again. It seems that there have been no major attacks on the network for a good week now. Almost all relays are currently green in the lists. For me, this calmed down so much that I was even able to start a second relay for the first time. Almost no NTor dropping at the moment.

Relay 1 (Guard): Blocked IPs: 4500 Connections: 6000 inbound / 2700 outbound

Relay 2 (middle): blocked IPs: 150 Connections: 2900 inbound / 3000 outbound

I think this is why I have started to count the overload, to see what the trend is across all the relays.

I have two relays

Relay 1 (Guard): Blocked IPs :1835 Connections: 2335 inbound / 3026 outbound

Relay 2 (Guard): Blocked IPs: 3199 Connections: 2938 inbound / 3127 outbound

Enkidu-6 commented 1 year ago

Relay 1 (Guard): Blocked IPs: 4500 Connections: 6000 inbound / 2700 outbound

Relay 2 (middle): blocked IPs: 150 Connections: 2900 inbound / 3000 outbound

The fact that you have 4500 IP addresses in your block list shows that there's some sort of attack going on. It's just that you are not feeling it because you're blocking most of it. I think the best test would be to remove the iptables rules and see if your system runs smoothly.

ToterEngel commented 1 year ago

I just tested that. After 10 minutes the guard relay had risen to: Connections (62357 inbound, 2377 outbound)

This is the first time I've seen the following message in the log...

"Tor's file descriptor usage is at 100%. If you run out Tor will be unable to continue functioning."

But with a good 10TB of traffic a day, no wonder :-D

Enkidu-6 commented 1 year ago

It's different at different times. Right now, there's not much going on. I flushed my block list and in 15 minutes it only gathered 60 IP addresses. But once the attacks start, which is probably in a few hours, I easily gather 3500 - 4000 in a matter of 10 minutes.

Enkidu-6 commented 1 year ago

As I expected the DDoS started a few minutes ago. I had removed the rules on one server as a test and were watching my other server. Once I noticed the DDos on my other server, I turned on the filter. A completely empty block list gathered 3500 IP addresses in 5 seconds.

Enkidu-6 commented 1 year ago

Hey guys, before I close this issue I wanted to ask you how things are going with the scripts, especially with the new version. Any feedback is appreciated. How's the status of the NTor drops if any?

cybermonkee commented 1 year ago

I had a few dropped NTors yesterday on my first relay, but this relay had lost its guard and stable flags - I think because the previous script was a bit aggressive perhaps?

One thing I have noticed is that the IP addresses of the authority relays do not seem quite right. Therefore, I run an additional script that get the relays, and then loads them into the allowed list. This may have also been a reason for my relay dropping flags.

I also added an additional allowed-relays list, and a script that puts the IPs from the blocked lists that match the relay lists into it.

Enkidu-6 commented 1 year ago

Thanks for the feedback about the authorities. It appears that moria1's address has changed. I checked my records and it's definitely not a typo. It's changed from 128.31.0.34 to 128.31.0.39. Any other IP address in there that I haven't noticed? I'll make modifications to the scripts. In the meanwhile you can save the ipset, change the IP and restore it:

ipset save allow-list -f /var/tmp/ipset.allow-list

Then edit and modify the IP address and then restore:

ipset restore -exist -f /var/tmp/ipset.allow-list

As for losing your flags, I really doubt it would be due to the scripts, even losing one authority won't stop the rest of them to reach consensus. The previous script was allowing only two connections and the new one allows 4, which means you'll get far less dual-or relays in the block list. In either of those cases I've never lost a flag for either of my relays.

Have you made any fundamental modifications to the iptables rules?

As for the Ntor drop, I've found that increasing the NumCPUs value will fix that. Add this line to your torrc and restart Tor:

NumCPUs 16

I suggest a minimum of 12. By the way NumCPUs is kind of misleading. It has nothing to do with the number of CPUs you have. It just tells Tor to create that many workers to crunch the numbers. You can have 4 actual CPUs but 12 worker threads. As long as you make sure you have available workers for Tor to use, it will continue processing them. You'll get the NTor drops when all workers are busy processing.

cybermonkee commented 1 year ago

No major fundamental rule changes, the only real change I made was add an additional allow list and corresponding IPTables rule for relays that I have identified that are snagged, I make this list time bound.

ipset create -exist allowed-relays hash:ip family inet hashsize 4096 timeout 21600 iptables -t mangle -I PREROUTING -p tcp -m set --match-set allowed-relays src -j ACCEPT

And then I just have a cron job that gets the output from the compare scripts and loads them into the allow-relays list - it's rare that I have many, but where possible I don't want to block relays.

I will take a look at the NumCPUs setting, I was reading about that before and found confusing information about it - thanks for the tip.

Enkidu-6 commented 1 year ago

Also check the order of your rules. The -I should have taken care of it but the allow lists must come before the hashlimit and conntrack rules. Just do :

iptables -L -t mangle --line-numbers

to make sure

cybermonkee commented 1 year ago

Thanks.... I think it is right... :-/

Chain PREROUTING (policy ACCEPT) num target prot opt source destination 1 ACCEPT tcp -- anywhere anywhere match-set allowed-relays src 2 ACCEPT tcp -- anywhere anywhere match-set allow-list src 3 DROP tcp -- anywhere anywhere tcp dpt:https flags:FIN,SYN,RST,ACK/SYN limit: above 1/min burst 5 mode srcip 4 DROP tcp -- anywhere anywhere tcp dpt:9001 flags:FIN,SYN,RST,ACK/SYN limit: above 1/min burst 5 mode srcip 5 tcp -- anywhere anywhere tcp dpt:https recent: SET name: tor-ddos side: source mask: 255.255.255.255 6 tcp -- anywhere anywhere tcp dpt:9001 recent: SET name: tor2-ddos side: source mask: 255.255.255.255 7 SET tcp -- anywhere anywhere tcp dpt:https #conn src/32 > 4 add-set tor-ddos src 8 SET tcp -- anywhere anywhere tcp dpt:9001 #conn src/32 > 4 add-set tor2-ddos src 9 DROP tcp -- anywhere anywhere tcp dpt:https flags:FIN,SYN,RST,ACK/SYN #conn src/32 > 4 10 DROP tcp -- anywhere anywhere tcp dpt:9001 flags:FIN,SYN,RST,ACK/SYN #conn src/32 > 4 11 DROP tcp -- anywhere anywhere match-set tor-ddos src 12 DROP tcp -- anywhere anywhere match-set tor2-ddos src 13 ACCEPT tcp -- anywhere anywhere tcp dpt:https 14 ACCEPT tcp -- anywhere anywhere tcp dpt:9001

Chain INPUT (policy ACCEPT) num target prot opt source destination

Chain FORWARD (policy ACCEPT) num target prot opt source destination

Chain OUTPUT (policy ACCEPT) num target prot opt source destination

Chain POSTROUTING (policy ACCEPT) num target prot opt source destination

cybermonkee commented 1 year ago

I use this to get the IP addresses for the authorities:

curl -s 'https://onionoo.torproject.org/summary?search=flag:authority' -o - | jq -cr '.relays[].a[0]' | grep -v null | tr -d '][' | sort -n

Enkidu-6 commented 1 year ago

I have a few lists in my other repository that you can pull as well:

https://github.com/Enkidu-6/tor-relay-lists

Enkidu-6 commented 1 year ago

By the way if the updated time and date lines in any of the lists interfere with your scripts you can always pull them with an sed:

| sed -e '1,3d'
ToterEngel commented 1 year ago

So yesterday my 2nd guard relay was shown as overloaded for a few hours. However, until today no Ntor drop was shown in the logs.

Now just checked and now there is "Ntor dropped (1165545) fraction 17.4642%" -> 3100 blocked IPs

My 1st guard relay even: "Ntor dropped (6486901) fraction 45.4110%" -> 1600 blocked IPs

So there is still some work to do to optimize the script ^^

Enkidu-6 commented 1 year ago

Have you tried increasing the number of CPUs? As I mentioned before, that pretty much is the only way to get rid of the NTor drops. Try adding

NumCPUs 16

to your torrc file if you haven't done so before.

Enkidu-6 commented 1 year ago

Also please let me know if you're using the new scripts or the old ones

ToterEngel commented 1 year ago

So far I've been using NumCPUs 12 on my i7. I took over every change to the scripts in a timely manner. So the current version of the script runs.

Enkidu-6 commented 1 year ago

NumCPUs 12 is pretty much the magic number for my relays, but I'm running them at Max Advertised Bandwidth of 20 MiB. If your Max is higher, you should choose a higher number.

ToterEngel commented 1 year ago

Yes, I have currently limited the bandwidth to 45 mb. Without attacks, it still ran cleanly even with >70mb.

Enkidu-6 commented 1 year ago

45 megabits as in 5.6 MiB? Your 8 VCPU processor should easily be able to handle that. I have a feeling there's some other kind of problem you're dealing with which I can't pinpoint. The last time I rebooted for updates, the system was running for about 25 days without a single NTor drop in logs and as I said I'm running them at max Advertised Bandwidth of 160 mbits or 20 MiB.

The way we are throttling the connections and by only allowing 4 connections per IP, is the best reasonable compromise we can make without putting too many legitimate IPS in the block list. So is the 12 hour limit for the block list. A shorter expiry will put more pressure on your system as the IPS will get released earlier and your system has to deal with them all over again.

ToterEngel commented 1 year ago

I meant 45 Mbytes -> 360 MBit and with 2 relays on the machine, the Gbit port is usually well utilized. I let through between 7 and 10 TB of traffic at Hetzner every day.

But as previously written, without attacks on the network, it also runs with more bandwidth without drops. Now, however, the nature of the attack seems to have changed somewhat. I have never had such high drop values in the past.