Open jgneff opened 2 years ago
I noticed some OSU server connection timeouts this weekend, e.g https://github.com/apache/netbeans/actions/runs/3349423376/jobs/5549411354
couldn't find the cause but it lead me to a bug in how our workflow invalidates the cache #4886. I supposed you tested your setup this weekend? :)
java.io.IOException: Could not connect to https://netbeans.osuosl.org/binaries/4B4DCA62F8C4A1954AE6D286955C36CC50B8CC3A-exechlp-1.2.zip within 15000 milliseconds
I supposed you tested your setup this weekend? :)
That could have been me, based on the timestamp (Fri 28 Oct 2022 05:41:42 PM PDT
). That's the part I don't know: whether the OSU Web servers mitigate such attacks only through a RequestReadTimeout
directive, or do they also limit the maximum number of connections from a single IP address. See the section "How is a Slowloris attack mitigated?" on the Cloudflare page about Slowloris.
If they do both, then the DoS attack is really just self-inflicted with a limited impact on other users of the Web server. I suspect, though, that they're using only the request-read header timeout, which means other users could encounter problems, too.
the way it works here is that we download everything into a cache. I think most CI builds shouldn't ping the server at all (assuming everything works as expected) since the cache is shared. Local builds work the same. Devs would only download the libs during the first build, subsequent builds only download the delta if there is any. (edit: basically like maven)
I think most CI builds shouldn't ping the server at all (assuming everything works as expected) since the cache is shared.
Right. I hit this bug because the Launchpad build farm runs each build in a transient container created from trusted images to ensure a clean and isolated build environment. It starts every build entirely from scratch.
Hi, I'm from the OSUOSL as mentioned on this issue. I see that @jgneff created a ticket on our support system which referenced this issue. I wanted to give some background on how your mirrors are setup in case that impacts how you fix this.
MaxConnPerIP 20
setreqtimeout_module
enabled with the following settings:
RequestReadTimeout header=20-40,minrate=500
RequestReadTimeout body=10,minrate=500
Hopefully this helps you out! Let me know if you need anything else from us.
I noticed some OSU server connection timeouts this weekend, e.g https://github.com/apache/netbeans/actions/runs/3349423376/jobs/5549411354
couldn't find the cause but it lead me to a bug in how our workflow invalidates the cache #4886. I supposed you tested your setup this weekend? :)
FWIW this seems to line up with a DDoS we were having at the time against our DNS servers (which also happened this morning unfortunately).
@ramereth Thank you for commenting, Lance. That answers some of my lingering questions.
We use mod_limitipconn with
MaxConnPerIP 20
set
That answers my previous comment, and indicates this really is just a self-inflicted denial-of-service attack affecting only the person running the build (me!). It also explains why the server is responding with status code 503. The README file of mod_limitipconn states:
The NetBeans build makes over 95 connections through the Squid proxy server to netbeans.osuosl.org
, so I'm surprised that it sometimes works. Perhaps Squid is multiplexing those onto a smaller set of forwarding connections, or holding off on connecting until it receives a request header. I'll look into it.
RequestReadTimeout header=20-40,minrate=500
That confirms the 20-second timeout I'm seeing on the unused connections which send no request headers.
The NetBeans build makes over 95 connections through the Squid proxy server to
netbeans.osuosl.org
, so I'm surprised that it sometimes works. Perhaps Squid is multiplexing those onto a smaller set of forwarding connections, or holding off on connecting until it receives a request header. I'll look into it.
@jgneff this might be due to the fact you might be hitting one of the other two servers that are in DNS rotation. If you haven't hit those as much, it likely would continue working.
If you'd like us to make any changes on our end to make this better, please let me know. I'm certainly willing to make a change if it makes sense and doesn't impact our service.
The NetBeans build makes over 95 connections through the Squid proxy server to
netbeans.osuosl.org
, so I'm surprised that it sometimes works. ... I'll look into it.
Here's what I found. The NetBeans build avoids the connection limit of mod_limitipconn
by not sending any request headers at all. That module hooks in too late in the request processing phase to enforce its limit on such unused, idle connections. Looking at the source code:
mod_reqtimeout
is ap_hook_process_connection
, called during connection processing, whilemod_limitipconn
is ap_hook_quick_handler
, called later during request processing when the request headers are read and parsed.My experiments confirm this. I can make 100 connections to netbeans.osuosl.org
through the proxy server, perform the TLS handshake, and as long as no request headers are sent, they'll just sit there for 20 seconds until they are closed by mod_reqtimeout
. If, on the other hand, I make 40 connections and send the request headers immediately, only 8 of them are successful. The other 32 return "503 Service Unavailable" due to the per-IP connection limit of mod_limitipconn
.
There's an experimental mod_noloris
from Apache that looks interesting and appears to enforce the per-IP connection limit earlier. See also mod_antiloris
and a good write-up called "Slowloris And Mitigations For Apache".
@jgneff how would you like to proceed on this?
@ramereth Thanks for asking. I have not found any problems that would require a change in the Web server on your end of the connection. On the contrary, when testing with a fix or a workaround, I have yet to encounter any errors with netbeans.osuosl.org
at all. So thank you and your team for such a reliable archive!
I have been working on a more general fix for the past couple of weeks. I plan to submit it as a new pull request and close the current one. So far in my testing, the fix is working well and makes a predictable set of just 10 connections to netbeans.osuosl.org
and two connections to repo1.maven.org
while downloading all of the external binaries.
I removed this issue from NB17 milestone. The linked PR #4206 explicitly requests not to be merged yet.
Apache NetBeans version
Apache NetBeans 16 release candidate
What happened
Building NetBeans in an environment that defines both proxy variables causes a brief denial-of-service (DoS) attack on one of the Web servers hosted by the Oregon State University (OSU) Open Source Lab. Below is a typical example of the variables being defined and exported to the environment:
The same build also causes an attack on the Sonatype Maven Central Repository, although the files in Maven Central are hosted by the Fastly Content Delivery Network (CDN), allowing it to continue serving its content regardless.
The attack is a timeout-exploiting connection flood, which works by establish pending connections with the target server. It is similar to the Slowloris attack, but less effective because it doesn't avoid timeouts through the use of partial requests. Instead, the NetBeans build opens hundreds of connections to the target Web servers and never sends any request headers at all. The connections are closed only when they time out on the server side or when the process that created them terminates on the build side.
Specifically, the build opens 469 unused connections to
repo1.maven.org
(an alias forsonatype.map.fastly.net
), sends no request headers, and leaves them open. The connections eventually time out on the server side, but that can take up to 20 seconds. This attack is unsuccessful in my experience, likely due to the Fastly CDN.The build opens only 95 unused connections to
netbeans.osuosl.org
(an alias forftp.osuosl.org
). Because the Open Source Lab hosts its files directly, though, the attack is usually successful in exhausting all request handlers. The Web server then returns an HTTP response status code of 503, which terminates the build. Even when the Web server is able to handle the load and the build is successful, it can take up to 20 seconds for the superfluous connections to time out on the server side.An unaware developer can assume that the build failure is a transient error and repeat the build until it's successful, as I did before uncovering the source of the problem. That has the unfortunate effect of turning a brief, one-time attack into a dozen or more repeated attacks throughout the day.
Furthermore, the build attempts to make 564 direct connections to the remote Web servers even when a proxy server is defined and working, which results in a waste of resources on the build machine itself. The build also creates more than 1,692 unnecessary operating system threads, even when no proxy servers are defined, and leaves them waiting in the system until the build completes.
How to reproduce
There are three ways to reproduce the problem:
The second and third methods require setting up a local firewall and proxy server.
Launchpad
One way to reproduce the problem is to run a remote build on Launchpad, which has a strict firewall and permits outbound connections only through its proxy server. Launchpad runs the build in an LXD container as follows:
That's how I discovered the problem, but this method provides no diagnostic information other than the 503 response code when the build fails. To find out what's really going on, you need to reproduce it locally.
netbeans-proxies
I wrote a simple program, called netbeans-proxies, that safely illustrates the problem without creating a burden on the target server. The program downloads just 14 kilobytes in five files, whereas a clean build of NetBeans downloads at least 754 megabytes in 564 files.
The program makes it easy to run, test, and debug the NetBeans build task and even step through its code one statement at a time. See the GitHub repository jgneff/netbeans-proxies for details on setting up its environment and running the tests.
download-all-extbins
To reproduce the problem using the actual NetBeans build, run the build on the same system that you set up for the netbeans-proxies program above. Disable the firewall long enough to clone the NetBeans repository:
Then double check that the firewall is active, save a backup of the original repository, set the proxy environment variables, and run just the downloading task as follows:
Before running the build a second time, you'll need to remove the cached files and start with a fresh copy of the repository, thereby removing the files that were downloaded into its subdirectories. For example:
If the build fails, you'll see an error message like the following:
Did this work correctly in an earlier version?
No / Don't know
Operating System
Ubuntu 20.04.5 LTS (Focal Fossa)
JDK
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu120.04)
Apache NetBeans packaging
Own source build
Anything else
Although the DoS attacks occur with every build when both proxy variables are defined, the build itself fails for me in about three out of every four runs.
Web servers
The NetBeans build downloads:
repo1.maven.org
, andnetbeans.osuosl.org
.Whether the build works or fails may depend partly on which address you receive for the Open Source Lab Web server:
One set of addresses is owned by the University of Oregon in Eugene, Oregon, while the other two sets of addresses are owned by TDS TELECOM in Madison, Wisconsin.
The Maven Central Repository is hosted behind the Fastly CDN, which seems capable of handling the connection flood:
Workaround
There is a partial workaround for the problem: simply unset one of the proxy environment variables, like so:
Even with this workaround, though, the build still tries to make hundreds of direct connections to the remote Web servers, but those are presumably blocked by the firewall.
Access logs
Below are the Squid access log files that I recorded from three full builds of NetBeans:
The complete log files are included below. The hundreds of superfluous connections can be identified by those to
netbeans.osuosl.org
that transferred only 176 bytes and those torepo1.maven.org
that transferred only 180 bytes. There are other unused connections in the log files, but those are the easiest to identify.The exchange on the unused connections starts with an outgoing request to the proxy server:
followed by the response from the proxy server:
followed by what appears to be an abbreviated TLS handshake, after which the connection is idle until it's closed by the remote Web server due to a timeout.
The first log file can be included inline, but the other two are too big for an issue comment and must be included as an attachment.
access-bypass.log
23 connections, all of them good:
access-failed.log
566 connections, at least 543 of them unused:
access-failed.log
access-worked.log
573 connections, at least 541 of them unused:
access-worked.log
Are you willing to submit a pull request?
Yes
Code of Conduct
Yes