malandis commented 1 year ago

Description

This is a follow-up to #6981

Summary of #6981

Our organization publishes networked client software in multiple languages. For our python client, we have observed a regression in integration tests when switching to ubuntu-22.04 runners. The integration tests pass consistently when using an ubuntu-20.04 runner, but are consistently flaky and fail when using ubuntu-22.04.

To contrast, when doing local development on ubuntu 22.04, the integration tests consistently pass. We have verified this in multiple environments, both in a desktop environment and on cloud VMs. We have never observed locally the behavior we see on the failed jobs.

@Alexey-Ayupov requested a network capture, and then reported:

The only problem I can see is in the DNS protocol communication.

He then escalated to the Azure support team.

New, related issues

We observe similar timeouts in other languages other than Python. We experience the timeouts in Node.js on both ubuntu-20.04 and ubuntu-22.04; and we also observe timeouts in Rust.

cc @Alexey-Ayupov

Platforms affected

[ ] Azure DevOps
[X] GitHub Actions - Standard Runners
[ ] GitHub Actions - Larger Runners

Runner images affected

[X] Ubuntu 20.04
[X] Ubuntu 22.04
[ ] macOS 11
[ ] macOS 12
[ ] Windows Server 2019
[ ] Windows Server 2022

Image version and build link

Node.js failed tests for ubuntu-20.04 Node.js failed tests for ubuntu-22.04 Rust failed tests for ubuntu-22.04

Is it regression?

Yes. The linked pull requests succeed on macos runners for Node.js or ubuntu-20.04 for rust

Expected behavior

Integration tests should pass and not time out

Actual behavior

Integration tests with networked calls fail on some tests, which can change from run to run. See logs in linked pull requests.

Repro steps

For Node.js, use an ubuntu-20.04 or 22.04 runner image; for Rust an ubunt-20.04 runner image.

Rust network capture here

Node.js 20.04 network capture here

erik-bershel commented 1 year ago

Hey @malandis! May I ask you to provide repro steps both for Node and Rust?

ilia-shipitsin commented 1 year ago

@malandis , timing looks like test met timeout

maybe there's some resource bottleneck (probably on runner side), can you try to increase timeout to see whether it will pass or not ?

malandis commented 1 year ago

Hey @malandis! May I ask you to provide repro steps both for Node and Rust?

Node: Open a PR in this repo with the runner image set to ubuntu-20.04 or ubuntu-22.04. See this PR for an example. Note that the tests pass when we run locally in Ubuntu, remotely on an Ubuntu VM, or on GH with a macos runner.

Rust: open a pull request in this repo. See this PR for reference. The build-rust job in the CI workflow intermittently fails.

malandis commented 1 year ago

@malandis , timing looks like test met timeout

maybe there's some resource bottleneck (probably on runner side), can you try to increase timeout to see whether it will pass or not ?

@ilia-shipitsin I increased the test timeout first to 60s (from 30s) and then to 120s. Both had the same error. See here.

Alexey-Ayupov commented 1 year ago

Hello @malandis, sorry for the late response. I see a lot of dropped packets toward ipv6 DNS. It looks like your tests are partially trying to get a DNS name using the ipv6 standard.

To test one of my versions, could you please help me? For the test purpose, please perform this command: sed -i 's/::1 ip6-localhost ip6-loopback/::1 localhost ip6-localhost ip6-loopback/g' /etc/hosts Please don't forget to make a backup of your /etc/hosts file.

malandis commented 1 year ago

Please don't forget to make a backup of your /etc/hosts file.

@Alexey-Ayupov just to confirm, you want me to run this locally (there would be no need to backup the hosts file on a test runner, right)?

Alexey-Ayupov commented 1 year ago

Hi, you mentioned that tests on your local machine are running successfully. I want to check the behavior of the test after changing the /etc/hosts file. In GitHub runners, we make a loopback for ipv6.

malandis commented 1 year ago

Hello @malandis, sorry for the late response. I see a lot of dropped packets toward ipv6 DNS. It looks like your tests are partially trying to get a DNS name using the ipv6 standard.

To test one of my versions, could you please help me? For the test purpose, please perform this command: sed -i 's/::1 ip6-localhost ip6-loopback/::1 localhost ip6-localhost ip6-loopback/g' /etc/hosts Please don't forget to make a backup of your /etc/hosts file.

I made this change and ran the tests multiple times. They still pass. I also started a fresh Ubuntu 22.04 VM, made the change there, and the tests pass (on multiple invocations).

Alexey-Ayupov commented 1 year ago

Thanks, I will check other possibilities.

malandis commented 1 year ago

I appreciate the diligence. Thank you!

vpolikarpov-akvelon commented 1 year ago

Hey @malandis. Do you observe such behavior still? I'm trying to reproduce this issue, but didn't succeed yet.

malandis commented 1 year ago

@vpolikarpov-akvelon yes, we still observe this behavior. I see your PR here. Because of our secrets management, you cannot open a PR from a fork. I have opened a PR here.

One important change: you had set the runner image to ubuntu-22.04 for the verify job. It is the test job that will fail on 22.04. I have added here.

cprice404 commented 1 year ago

@vpolikarpov-akvelon if it would make it easier for you to test/reproduce, we can give you an auth token that you could set up as a secret on your fork and then it would be possible to run the tests on your PR. Just let us know if there is a good way to get the token to you and we will can send it your way!

vpolikarpov-akvelon commented 1 year ago

Hey @cprice404 @malandis. I have some new information regarding this issue.

Network dumps analysis demonstrated that in case of failure test suite only connects control endpoint and even doesn't try to start connecting to cache endpoint. Looks like something goes wrong on communication with control endpoint. Unfortunately, it takes place over TLS so I can't analyze it.
I tried to execute tests in Azure on virtual machine created from runner image. The issue was there until I assigned public IP address to the network interface of the machine. With public IP address everything works fine.

Taking these points together I would assume that either something blocks requests on the API's side or communication breaks somewhere in Azure or AWS network. Could you please check configuration of your infrastructure in AWS (especially in terms of network security) and verify all the changes made around the moment when you started observing this issue. Also I would like you to examine logs of your service backend: it may be useful to know how those failures look like from backend's point of view (there should be anything in logs since test suite connects and communicates with control endpoint).

vpolikarpov-akvelon commented 1 year ago

Hey @cprice404 @malandis. Did you manage to find something that may negatively affect communication with control endpoint?

cprice404 commented 1 year ago

Hi @vpolikarpov-akvelon thanks so much for looking into this.

We will look at the logs more today and see if anything shows up but in the past this has appeared to be related to DNS (see https://github.com/actions/runner-images/issues/6981#issuecomment-1415360651). If it was a server-side issue with the control endpoint then I would expect the same issue to occur when running the tests via MacOS?

verify all the changes made around the moment when you started observing this issue.

For most of our repos, the moment we started observing it is when ubuntu-latest on github switched from 20.04 to 22.04. It may be that the nodejs one is not the best one to use to diagnose, since it is the only one that fails on both 20.04 and 22.04. The rust one that @malandis linked above may be more representative and smaller in scope.

We can also reproduce it in either of these repos: https://github.com/momentohq/client-sdk-go https://github.com/momentohq/client-sdk-python

I tried to execute tests in Azure on virtual machine created from runner image. The issue was there until I assigned public IP address to the network interface of the machine. With public IP address everything works fine.

This is really interesting. Do you know how the DNS setup changes between those two configurations?

vpolikarpov-akvelon commented 1 year ago

I really doubt that it's related with DNS. There is negative response to AAAA query in network trace indeed, but there is also positive response to A query with the same name, that means that it was resolved to IPv4 address successfully.

Also I wouldn't rely on the fact that everything works fine on macOS runners. They live in environment that differs a lot from the one where Ubuntu and Windows runners are.

cprice404 commented 1 year ago

They live in environment that differs a lot from the one where Ubuntu and Windows runners are.

This is one of the data points that makes us think that it must be an environmental issue :)

For example, our python SDK tests currently work fine on github on ubuntu 20.04, but fail on 22.04. Would it be helpful for us to create a PR to reproduce that failure again to show another example?

vpolikarpov-akvelon commented 1 year ago

Would it be helpful for us to create a PR to reproduce that failure again to show another example?

Yeah, I think so. We are in a dead end now, so any information may be useful.

cprice404 commented 1 year ago

Done, here's a failed run of the python version:

https://github.com/momentohq/client-sdk-python/actions/runs/5591096534/jobs/10221706048?pr=357

cprice404 commented 1 year ago

This is probably the most representative run of what we frequently see with 22.04 that we do not see with 20.04:

https://github.com/momentohq/client-sdk-python/actions/runs/5591096534/jobs/10222519393

vpolikarpov-akvelon commented 1 year ago

@cprice404

I tried running tests in python repo and caught some strange behavior: when I run all tests then test_create_list_revoke_signing_keys fails often, but when I run only that one it always passes. What may be the reason for that?

cprice404 commented 1 year ago

I will look into that, I don't have a good guess off of the top of my head but it's possible that that one is some kind of flakiness in the definition of the test itself. if you comment out that test are you able to reproduce the other failures that I linked in that last link that I posted above?

I would be happy to update that PR to comment out that signing_key test if it would help.

vpolikarpov-akvelon commented 1 year ago

I tried to switch signing_key test off and everything started working flawlessly. It's a bit strange that this test fails when is run among others, but always succeeds alone. Anyway it looks like there is something wrong with it. May it conflict with other tests somehow? And may this conflict be platform-dependent? What to you think?

vpolikarpov-akvelon commented 1 year ago

I experimented a bit more. Here is interesting observation. Test create_list_revoke_signing_keys never fails alone, but it's enough to run it after list_caches_succeeds and list_caches_succeeds_even_if_cred_provider_has_been_printed to get a failure. But it doesn't fail on the first run, one should restart workflow several times to see it failing. And starting from that moment, it never succeeds unless is given some time to "rest". Also, first failure usually appears as connection reset, while others as remote host unavailability. It indeed looks like something wrong either with the network or with the backend service - like tests app is blocked for some reason by some kind of firewall or filter.

I know, if it's network rules to blame then you may wonder why it works on Ubuntu 20. Well, it may be related with external IP addresses of runner, for example. A crazy assumption: IP addresses of Ubuntu 22 runner addresses may be trusted less then IP addresses of Ubuntu 20 runners (on any network node that is in on the way from GitHub runners net to your backend net). I doubt that we can really check it somehow. I only may suggest you to verify network security configuration for your backend once more.

halgari commented 1 year ago

We are experiencing this same issue with dotnet 7 on ubuntu-latest. Locally and in local VMs running Ubuntu desktop LTS we do not see this issue, however in github actions, connections to or http test server (running in-process on 127.0.0.1) we see this issue on almost every run.

vpolikarpov-akvelon commented 1 year ago

Hey @halgari. Could you open a separate issue for that and provide more information and repro steps?

cprice404 commented 1 year ago

@vpolikarpov-akvelon have you tried running this python reproducer on an Azure node with public IP address to see if it works in that configuration, as you said that it did with the node.js reproducer?

vpolikarpov-akvelon commented 1 year ago

No, I didn't and I don't think it would help. If it behavior will be the same as for node.js tests, we would get one more indirect evidence; if not - we won't learn anything new.

I still think it's something with network and not with runner images itself. If this is really the case then Azure or AWS support may help you, I think.

One more thought: you may try to resolve public IP address in you workflow and track connections from those addresses in your backend logs, maybe there would be something interesting.

vpolikarpov-akvelon commented 1 year ago

Hey @cprice404. Did you have a chance to analyze backend logs and network security configuration? I have a strong impression that it's something out of our area of responsibility and we can't control it as there are a lot of indirect evidences that indicate that problems lays outside of runner itself.

If you have any other ideas about what we can test within runners, then let me know and I'll be glad to help you with investigation.

cprice404 commented 1 year ago

@vpolikarpov-akvelon thank you for your response, we deeply appreciate your willingness to continue checking in on this issue.

Our network security settings allow unrestricted access on 0.0.0.0 and we have never experienced this issue with any of our SDKs from outside of github, and we have never had a customer report running into anything like this. I'd be happy to share screenshots of the security group rules if that would be useful.

My teammate commented out those signing key tests in the python reproducer PR because they don't really match the normal pattern that we see on github across all of our programming languages. The tests still failed and now the error looks more like what we typically see:

https://github.com/momentohq/client-sdk-python/actions/runs/5815221224/job/15766468549?pr=357

When we get these Timeout / DEADLINE_EXCEEDED exceptions it basically always means that the request never made it to the server. I combed through the server logs from the time range of this run and I see no evidence that the requests ever got there.

I still don't understand how this could be a security configuration issue because if it was, I would expect all requests to fail from the problematic ubuntu 22.04 runners, as opposed to sporadic/intermittent requests. And I also would be surprised if Azure had configured different IP ranges for the 22.04 runners than the 20.04; have you found any documentation or seen any patterns that would lead you to believe that?

If you ran those python tests on an Azure runner with a public IP address and took a pcap, and we compared that to one from a failed run on a runner without a public IP address, do you think that might offer any clues?

cprice404 commented 1 year ago

Just to clarify re: security group rules - our service is public so we do not have any CIDR range restrictions.

vpolikarpov-akvelon commented 1 year ago

Speaking of network security, I'm thinking of some kind of score-based Intrusion Prevention System that may work anywhere on the way between runners and your service. It would explain sporadic nature of failures: sometimes overall score is high enough for the request to be blocked and sometimes is not.

And about logs. May it be possible that log messages about previous/adjacent requests have some useful information?

cprice404 commented 1 year ago

all of the adjacent requests in the logs are successful / behaving as expected.

Are there steps documented somewhere about how to reproduce a gh actions run directly? I feel like the best next steps I can think of here are to do some runs on Azure instances covering the matrix of these dimensions:

with and without public IPs
with and without the gh actions environment wrapping them

If we do runs on Azure with and without public IPs, but just do them directly on the azure instance without the gh actions environment, and they are always successful, and then we do runs with and without the gh actions environment and it passes with public IPs and fails with private IPs, then that would suggest that it is something that has to do with the combo of gh actions + private IPs, would it not?

vpolikarpov-akvelon commented 1 year ago

Hey. Bad news. I've got a firm evidence that there is something wrong with network but not runners itself. I've tried to tunnel requests over some 3rd server and everything started working as expected.

Here is a link to the run: https://github.com/vpolikarpov-akvelon/client-sdk-python/actions/runs/5854627487. My setup is quite absurd (socks5 tunnel over ssh + local privoxy that forwards requests to tunnel) as I was building it in a rough-and-ready manner.

I understand that it doesn't help you resolving your issue. Also I understand that it's more important for you then for anyone else as this issue may not only affect your test pipelines but also pipelines of your customers. But since there is nothing wrong with runner images I don't know how could we help you resolving this.

As a workaround I may only suggest you to configure proxy as I did or to run tests on self-hosted runners.

cprice404 commented 1 year ago

thanks for looking into it. So just to make sure I understand, you're saying that you think there's some networking issue somewhere between Azure's private-IP instances and AWS's load balancers that is causing it? And if you proxy through another host so that the networking is going from that host to AWS instead of from Azure to AWS, that avoids the networking configuration issue and resolves the problem?

vpolikarpov-akvelon commented 1 year ago

Yes, this is exactly what I observe. Unfortunately, we can't narrow a search and identify what node causes this issue. As tunneling requests out of GH infra solves this problem it looks logical that it's something on GH side, but I wouldn't bet on it as it may only appear when GH and AWS meet each other.

If you want to investigate it and if you have a time for it, you may try investigating how proxy server affects tests more thoroughly. In my test I used ssh tunneling from runner to 3rd server, it makes traffic not only comes to AWS from different public IP address, but also make it completely different on GH's side (ssh instead of tls), so it only proves that there is nothing wrong with runner software itself, but says nothing about GH and AWS infra. So you may try to rule them out separately in order to identify who is to blame.

Anyway, as problem lays outside runners themselves, I'm closing this thread. Feel free to open a new one or start a discussion in corresponding section if you have any other questions. Sorry for not being able to resolve your issue.

cprice404 commented 1 year ago

@vpolikarpov-akvelon can I ask you one more question about this?

Is there something different about how gh actions provision runner instances for ubuntu 20.04 vs. 22.04? If your theory about the networking is correct then it seems like there must be a difference there, right?

vpolikarpov-akvelon commented 1 year ago

You are welcome.

There is no difference in how GH provision Ubuntu runners of different versions, but of course they are grouped into some kind of pools that may 'physically' be located differently. Also please take into account that 22.04 runners match 'latest' alias currently so they are used more than 20.04, and it also may affect something.

cprice404 commented 1 year ago

@vpolikarpov-akvelon do your Azure runners all get created in a specific region?

vpolikarpov-akvelon commented 1 year ago

You mean GH runners, don't you? They are not actually running in vanilla Azure and there are pools in different regions for each flavor.

cprice404 commented 1 year ago

yes, I meant GH runners. Just wondering how the region gets chosen.

Also, I set up some Azure instances today, Ubuntu 20.04 and 22.04, with and without public IP. I got the same results we've been seeing on github runners. (22.04 with private IP is the only one that demonstrated the transient failures.)

Since github has a relationship with Azure and you are effectively a customer of their VM service in this case, are you able to open a support ticket with Azure directly to make sure they are aware of this issue?

chkimes commented 1 year ago

I received a link to this issue and, admittedly, I have done very little investigation. However after a brief skim of the behaviors described here, it feels like this is perhaps related to SNAT exhaustion? https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections

This would explain both why private vs. public IPs make a difference, why "resting" the tests helps them succeed, and why a SOCKS5 proxy makes the problem go away (connections are made to the local proxy instead of leaving the machine).

Do the tests make a large number of outbound connections in a short period of time?

cprice404 commented 1 year ago

Thanks a lot for that link.

The tests do make a fair number of connections, but it would be on the order of 10s to maybe a hundred. Is there something I should intuit from this table:

https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections#preallocatedports

that would give me an idea what the maximum number of ports available would be?

Also, I'm still a little lost as to why these SNAT limits would impact an Ubuntu 22.04 VM but not an Ubuntu 20.04 VM?

chkimes commented 1 year ago

We expect that failures shouldn't happen until the thousands of connections range. I don't have an immediate theory for why it works in one and not the other, but I would not be surprised to learn about connection management behaviors changing with different versions of underlying software (e.g. connection pooling, HTTP library connection handling).

Do we have packet captures of both succeeding and failing runs? That would help immensely in understanding any behavioral differences.

cprice404 commented 1 year ago

Yes, we should have them from the previous incarnation of this ticket, #6981. I will ask my colleague @malandis to link them again.

malandis commented 1 year ago

@chkimes there are some captures on the ticket that @cprice404 mentioned.

The failed run packet capture is here (and for context was originally mentioned here)
The successful run packet capture is here (and for context was originally mentioned here)

Thanks for looking in to this. Let me know if you need more data.

chkimes commented 1 year ago

So I don't think this is SNAT exhaustion. The failed capture only shows 37 SYN packets to non-Azure-internal endpoints and the successful capture is still only 196. These are well below where I'd expect to see issues from running out of ports.

I'm having a bit of trouble mapping the packet capture to a particular failure mode. Typically I would approach this by mapping the timestamps of application behaviors in the logs to network behaviors in the packet capture, however the Actions logs attached to these packet captures have long since expired so instead I'm searching for needles in the haystack. Are there any more recent captures that I can correlate with logs that haven't expired yet?

The only thing I'm able to immediately notice here (aside from the obviously shorter pcap length on the failed job) is that the DNS result from resolved on Ubuntu 22.04 has an IP packet TTL of 1, while on 20.04 the IP packet TTL is 64. While I don't fully understand this, I also don't think it's having an impact since applications are clearly using the DNS result in later requests.

I'll dig through this capture a little more, but root causing may be difficult without the logs.

chkimes commented 1 year ago

Well I found a needle. However, I don't understand it.

I found some recently failing logs and checked the failing test, which appears to be using a timeout of 60 seconds: https://github.com/momentohq/client-sdk-python/blob/3ca90e624bb031c616ac36dbf6f85e6cd955d1c5/src/momento/internal/aio/_scs_control_client.py#L31

At 16:23:59, the test run is clearly cleaning up a bunch of connections that it is likely aborting because something else failed and the process is cleaning up:

So I decided to check what's happening 60 seconds earlier in hopes of catching something that is starting the behavior that eventually fails.

So there are a few things that are very odd here. One is that there is a DNS query to your infrastructure that is not immediately followed by a new connection. Every other DNS query in the capture has an immediate SYN packet (because of course why make the DNS query if you're not trying to initiate a connection?). Examples:

The second oddity is that 11 seconds later exactly at 16:22:59, we get a seemingly random errant SYN/ACK packet being sent to this VM. The VM rightly sends a RST packet because it has no idea what connection the packet is in reference to. The IP is otherwise entirely unmentioned in the entire packet capture:

This is very unexpected. Our VMs do not allow inbound traffic, so receiving a SYN/ACK from an IP/port that we have not previously initiated a connection to is not a normal condition. The timing is also incredibly suspect as well, since it's exactly 60 seconds before the test process begins cleaning up its other connections. There are a handful of mysteries on my mind at the moment.

Where did this packet come from?
If the application thinks that it sent a request around 16:22:59, where are the packets that map to that request?
What is difference between 20.04 and 22.04 that would cause this?

I think this behavior is a smoking gun for your issue. The trick now is identifying the trigger. Correlating logs to a packet capture would help cement the above data (since currently I'm intuiting it all from pcaps with all encrypted application data). I'll look for any notable infrastructural differences between these VM types, however my expectation is that outside the VM they should largely be configured the same.

cprice404 commented 1 year ago

Hi @chkimes - thanks so much for digging in to this. Your findings are fascinating.

One question I have - do you have any idea what that IP address for the inbound SYN/ACK packet could be? Does it appear elsewhere in the TCP dump? I tried to do a reverse DNS lookup on it and poke around google a bit; it appears to be an Azure IP but that doesn't necessarily clarify much.

As for application logs - we are more than happy to work with you to provide those. As of right now I don't think this test code (or the underlying library code) logs enough to be useful, but we are actively working on putting together a simpler reproducer. If we succeed at that we can easily instrument the reproducer code with good logging.

In the meantime if you would like to run the existing test code directly, I can give you an auth token that would allow you to do so. As you've already observed the code itself is all open source.

One more detail of note is that as of late last week, one of my teammates added some new tests, and the failures started to repro on Ubuntu 20.04 as well. He switched it to MacOS and they passed.

So it is starting to appear that, while whatever is happening is much more likely to occur on 22.04, it can still occur on 20.04.

We will share more info about our attempts to reproduce as soon as we have any.

actions / runner-images

regression in integration tests for ubuntu runner images; tests pass locally but fail in GitHub Actions (followup of #6981) #7402

Description

Summary of #6981

New, related issues

Platforms affected

Runner images affected

Image version and build link

Is it regression?

Expected behavior

Actual behavior

Repro steps