Closed weiznich closed 2 years ago
This affected crates.io so I've marked this as high priority. cc @pietroalbini
There is relevant discussion at https://gitter.im/diesel-rs/diesel?at=61ddafba9b470f38975fb9a2
As far as already know the issue seems to be that libpq does not finish our connection test query here: https://github.com/diesel-rs/diesel/blob/09f8bcd78e3fb634ed69c66f46695b821a5b3822/diesel/src/r2d2.rs#L90
I'm really unsure how and where we should fix this. Libpq does not seem to offer any useful option for timeouts here (at least not without using their async interface, which would require rewriting the whole postgres connection implementation). Possible ideas are:
Using the tcp_user_timeout
libpq url parameter seems to work for the reproduction given above:
DATABASE_URL="postgres://localhost/diesel_test?tcp_user_timeout=2000" cargo run
+ dropping the packages via iptables
while the test application is running causes a panic as expected.
Edit: Turns out that was an false positive due to other changes. Unfortunately those make the pool not detect broken connection at all. The described behaviour happened on the first use :disappointed:
Relevant r2d2 issue: https://github.com/sfackler/r2d2/issues/116
- Enforce the timeout in r2d2. That would likely require a separate thread for the connection check, which would slow down things drastically. Also this would raise the question what should happen with a stuck thread there, as there is no real way to cancel an os thread.
I've tried this and it seems like it does not work as we are not able to get the connection into the thread at all. is_valid
gets a &mut PgConnection
but using std::thread::spawn
requires 'static
to be able to access values.
Seems like there is probably no easy way to solve this at all.
Thanks for looking into this! I will probably have more time to spend on investigating this either tomorrow or Friday.
I've opened #3017 with an potential fix, but I'm not to happy about that change.
Smaller program reproducing this bug:
use diesel::pg::PgConnection;
use diesel::prelude::*;
const CONNECTION_URL: &str = "postgres://pietro:pietro@localhost/cratesio";
fn main() {
let _ = packet_loss(false);
if std::env::args().any(|arg| arg == "reset") {
return;
}
let conn = PgConnection::establish(CONNECTION_URL).unwrap();
assert!(example_query(&conn));
packet_loss(true).unwrap();
assert!(!example_query(&conn));
packet_loss(false).unwrap();
assert!(example_query(&conn));
}
fn example_query(conn: &PgConnection) -> bool {
println!("running example query...");
diesel::sql_query("SELECT 1;").execute(conn).is_ok()
}
fn packet_loss(enable: bool) -> Result<(), ()> {
std::process::Command::new("iptables")
.arg(if enable { "-A" } else { "-D" })
.args(&["OUTPUT", "-p", "tcp", "-m", "tcp", "--dport", "5432", "-j", "DROP"])
.status()
.map_err(|_| ())?;
println!("packet loss: {enable:?}");
Ok(())
}
You can run it on a Linux system with root privileges:
sudo -E ~/.cargo/bin/cargo run
Note that the program will modify the firewall rules automatically. They will reset once the computer reboots, but you can also reset them manually by running:
sudo -E ~/.cargo/bin/cargo run reset
@pietroalbini Thanks for the minimized example. I think I've already found a potential solution with #3017. Can you provide a few information about the needs of the crates.io team here:
I'll check with the rest of the team and get back to you in the coming days!
Just want to comment, that we are also users of diesel r2d2 postgres, and we are also affected by this issue! would love to have a fix upstream
@garbageslam Please don't comment on issues if you do not have anything other to say that this affects you as well and that you would like to see someone else fixing it for you.
I had another look at this this week. Let me shortly summarize the results here:
poll
/select
instead of busy waiting on new input data. This works and seems to "fix" the issue, as we can abort after a fixed amount of time. The downside is that this requires quite heavy modifications in the implementation of PgConnection
+ this requires us to re-implement large parts larger parts of libpq itself. Additionally libpq documents that you should call PQcancel
to indicate to the server that we are not interested anymore in the result. Obviously that will not be send to the server if the network connection is gone, but we should nevertheless issue that due to the fact that users could hit that timeout due to other reasons. Now the problem is that PQcancel
is a blocking function, so if the network is gone it will just wait forever. That means we would be back to square one. I do not see any good solution to only call PQcancel
only if we are sure that the network is gone, because we just don't know the reason for the timeout. So that seems like a dead end for me. I think that was also the reason why the postgres developers refused to put something like that into postgresql itself.TCP_USER_TIMEOUT
should override that. To test that I've used nix::TcpUserTimeout
to set that and it seemed to solve the issue locally. At least the example above did fail after some time. That would be easy to implement, but only available on linux.libpq
again. They have a tcp_user_timeout
option in their connection strings as well and it seems like I even tested that earlier. I've tried that again, as manually setting this parameter at the socket itself did resolve the issue. So long story short setting DATABASE_URL="postgres://localhost/diesel_test?tcp_user_timeout=200"
works for me, so not sure what went wrong with that earlier on.Given those facts I would prefer to just use the features libpq provides here as that seems to work.
cc @pietroalbini
Sorry to jump in, but I believe we have been experiencing this problem with the Azure PostgreSQL service. We see the PG server side drop TCP sessions and our system goes into the TCP retry loop, which on a fast network can take up to 15 minutes to complete with default linux kernel settings. We were surprised that our apps were blocking on that timeout before diesel/r2d2 timeouts were taking affect.
Setting TCP_USER_TIMEOUT to 200ms seems really aggressive, but that might be an example vs a recommendation. The TCP_USER_TIMEOUT will take precedence over other TCP timeouts like keepalive and TCP retry settings. TCP retries start a just over 200ms on a low latency network, so if you've just lost a packet or two the system wouldn't have time to recover. If you set this to be really aggressive, and your db connection went idle, you would probably end up killing the connection before a reasonable keepalive period would pass. If you're pooling I think that would cause the connections to flap and might exhaust TCP connection limits on the server side. At very least you would end up paying the overhead of establishing the TCP/TLS session a lot more often.
If the workaround is to "fail fast" at the TCP layer, we are accomplishing that by reducing the number of TCP retries from the default of 15 to 3 or 4 by setting net.ipv4.tcp_retries2=4
at the system level. This isn't ideal since its a system wide setting that requires elevated privileges. More annoying if you're using containers since it need to be set in each container.
Here's a good article on linux tcp retries and how the timeout works.
Our wish is that the diesel/r2d2 wouldn't be blocked by the network IO and would move on to retry on a healthy connection after a timeout period so we wouldn't have to do system level changes to reduce the impact of a less than ideal network connection.
In any case, thanks for working on this! :+1:
Thanks for your comment here.
We were surprised that our apps were blocking on that timeout before diesel/r2d2 timeouts were taking affect.
Diesel and r2d2 are two different crates. r2d2 already has an issue for this: https://github.com/sfackler/r2d2/issues/116. The short answer is: They cannot do much there, as there is no way to stop in the middle of executing stuff.
Setting TCP_USER_TIMEOUT to 200ms seems really aggressive, but that might be an example vs a recommendation. The TCP_USER_TIMEOUT will take precedence over other TCP timeouts like keepalive and TCP retry settings. TCP retries start a just over 200ms on a low latency network, so if you've just lost a packet or two the system wouldn't have time to recover. If you set this to be really aggressive, and your db connection went idle, you would probably end up killing the connection before a reasonable keepalive period would pass. If you're pooling I think that would cause the connections to flap and might exhaust TCP connection limits on the server side. At very least you would end up paying the overhead of establishing the TCP/TLS session a lot more often.
If the workaround is to "fail fast" at the TCP layer, we are accomplishing that by reducing the number of TCP retries from the default of 15 to 3 or 4 by setting net.ipv4.tcp_retries2=4 at the system level. This isn't ideal since its a system wide setting that requires elevated privileges. More annoying if you're using containers since it need to be set in each container.
Here's a good article on linux tcp retries and how the timeout works.
To be clear: tcp_user_timeout=200
is just an example. In the end you need to adjust that value to your needs, not just because I wrote that value works in my tests. I've chosen 200, just because I did not want to wait that long 🤷 . I've also verified now that this works just fine with larger value, even without changing something at the operating system level.
Our wish is that the diesel/r2d2 wouldn't be blocked by the network IO and would move on to retry on a healthy connection after a timeout period so we wouldn't have to do system level changes to reduce the impact of a less than ideal network connection.
To be clear here: There is really not much we can do here. You cannot just abort operations at operating system level, so if something is blocking it just blocks till it's done. That's how these things are designed. If you don't want that you need to use an async
approach. For diesel that would be diesel-async
. That allow you to control such things in detail. And to be clear here: There is exactly one reason I haven't closed that issue as "won't fix", works as designed yet. That's because the crates.io folks do require a solution with sync diesel here for technical reasons.
So long story short setting
DATABASE_URL="postgres://localhost/diesel_test?tcp_user_timeout=200"
works for me, so not sure what went wrong with that earlier on.
I played around with tcp_user_timeout
, and indeed it allows to configure the overall timeout to whatever value we find acceptable in production. That sounds great, if you want to close the issue please do!
By the way, I also remember trying tcp_user_timeout
back then in my investigation, and I also apparently did something wrong when I tested it. Somehow reassuring I wasn't alone in trying it wrong :sweat_smile:
Setup
Versions
Feature Flags
Problem Description
Cargo.toml
main.rs:
Using the following reproduction steps produces an application freeze:
DATABASE_URL=… cargo run
sudo iptables -A OUTPUT -p tcp --dport 5432 -j DROP
What is the expected output?
I expect a panic due to the unwrap in line 199.
What is the actual output?
The output stops + the application seems to hang
Steps to reproduce
Using the following reproduction steps produces an application freeze:
DATABASE_URL=… cargo run
sudo iptables -A OUTPUT -p tcp --dport 5432 -j ACCEPT
Checklist