firezone / firezone

Enterprise-ready zero-trust access platform built on WireGuard®.
https://www.firezone.dev
Apache License 2.0
6.68k stars 280 forks source link

Flaky systemd-resolved DNS test #4921

Closed jamilbk closed 1 week ago

jamilbk commented 4 months ago

https://github.com/firezone/firezone/actions/runs/9006460494/job/24744332707

Seen this one happen a few times. Maybe a timing issue / race condition that a sleep could fix?

ReactorScram commented 4 months ago

Will look into it after #4899

I thought maybe systemctl start is returning before our tunnel interface is all the way up. When it passes, this resolvectl dns tun-firezone prints the sentinel https://github.com/firezone/firezone/actions/runs/9006458384/job/24744284872#step:6:65

But systemctl start is supposed to wait for us to notify systemd that we're ready, and we only do that if we're running as an IPC service (not applicable) or right after we configure DNS.

Maybe the DNS configuration is secretly async on the inside. I'll add some debug logs as the next step

ReactorScram commented 3 months ago

Happened again, but debug_exit didn't do what I needed: https://github.com/firezone/firezone/actions/runs/9009905220/job/24755105430

jamilbk commented 3 months ago

Fixed by #4962

ReactorScram commented 3 months ago

Still happening. https://github.com/firezone/firezone/actions/runs/9290693824/job/25567691838

PR #5111 will move the systemd notification up to the Client so maybe that will give us more control over it.

ReactorScram commented 2 months ago

Another replication https://github.com/firezone/firezone/actions/runs/9469788060/job/26089691139 In this case the DNS didn't get controlled even though our logs indicate we thought it had https://github.com/firezone/firezone/actions/runs/9469788060/job/26089691139#step:6:93

ReactorScram commented 1 month ago

Still happening - #5911

This call to sd_notify happens way too early: https://github.com/firezone/firezone/blob/bf693ad83f2fedf2380a843311c05538318b9598/rust/headless-client/src/standalone.rs#L186

That may not be the only root cause, but it's not helping anything.

ReactorScram commented 1 month ago

https://github.com/firezone/firezone/actions/runs/10068439728/job/27834479145

ReactorScram commented 1 month ago

https://github.com/firezone/firezone/actions/runs/10080667883/job/27871200510

ReactorScram commented 1 month ago

That PR may not have main merged

jamilbk commented 1 month ago

Nah it's got the sleep :-(

https://github.com/firezone/firezone/actions/runs/10080667883/job/27871200510#step:5:19

thomaseizinger commented 1 week ago

Fixed? Not seen this one in a while.

ReactorScram commented 1 week ago

6026 might have fixed it. It merged into main 2 days after Jamil's last comment, so that could explain it

ReactorScram commented 1 week ago

Will close for now