getsentry / relay

Sentry event forwarding and ingestion service.
https://docs.sentry.io/product/relay/
Other
319 stars 91 forks source link

Upstreams disconnects after 1 min. Is this expected? #877

Closed mvaled closed 1 year ago

mvaled commented 3 years ago

I've had an on-premises Sentry version 9 for several years. Two weeks ago I found the time to update to Sentry 20. I took a look at your onpremise configuration and I figure a way to deploy in our server.

I've noticed relay disconnects to the upstream web server each minute:

2020-12-07T21:57:07Z [relay_server::actors::upstream] INFO: registering with upstream (http://web.sentry.svc:9190/)
2020-12-07T21:57:07Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: Connection reset by peer (os error 104)
2020-12-07T21:58:07Z [relay_server::actors::upstream] INFO: registering with upstream (http://web.sentry.svc:9190/)
2020-12-07T21:58:07Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: Connection reset by peer (os error 104)
2020-12-07T21:59:07Z [relay_server::actors::upstream] INFO: registering with upstream (http://web.sentry.svc:9190/)
2020-12-07T21:59:07Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: Connection reset by peer (os error 104)

I think this is caused by a Keep-Alive header somewhere. Can I increase it? Is there a more efficient protocol to connect relay to the upstream web?

untitaker commented 3 years ago

Can you try adding

http:
  _client: reqwest

to your config?

mvaled commented 3 years ago

@untitaker

Thanks for you response. After that change, things got bad turn:

2020-12-07T22:15:14Z [relay_server::actors::upstream] INFO: registering with upstream (http://web.sentry.svc:9190/)
2020-12-07T22:15:14Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
2020-12-07T22:15:31Z [relay_server::actors::upstream] INFO: registering with upstream (http://web.sentry.svc:9190/)
2020-12-07T22:15:31Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
untitaker commented 3 years ago

Cool, that's worse. Do you have capability to inspect the response body of sentry to relay? I wonder why the request is invalid.

untitaker commented 3 years ago

Context: We intend to switch to reqwest as HTTP client (the option to switch back/forth is temporary), so I think it's wasted time to investigate bugs in the old client.

mvaled commented 3 years ago

Do you have capability to inspect the response body of sentry to relay?

Probably, but right now I have no clue. I would have to look the at the code to see if I could trap the request/response or put DEBUG=True in the webserver: Would it help?

Changing relay's logging level to DEBUG didn't helped much, though.

mvaled commented 3 years ago

I'm looking at the code of the upstream actor, and I noticed I've never spotted the info log at:

                log::info!("relay successfully registered with upstream");
                slf.auth_state = AuthState::Registered;
                slf.auth_backoff.reset();

But I've received events from clients, so... what's happening here? Am I missing something?

mvaled commented 3 years ago

Well I'm not getting events from clients right. I'm triggering events in some the applications and I don't see them appear in Sentry :*(.

Still the logs just show:

2020-12-08T00:17:19Z [relay_server::actors::upstream] INFO: registering with upstream (http://web.sentry.svc:9190/)
2020-12-08T00:17:19Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: Connection reset by peer (os error 104)
2020-12-08T00:17:43Z [relay_server::actors::events] ERROR: error processing event: event exceeded its configured lifetime
2020-12-08T00:17:43Z [relay_server::actors::events] ERROR: error processing event: event exceeded its configured lifetime
mvaled commented 3 years ago

Hum... I ran docker pull for all the images snuba, rely, sentry and symbolicator. Now I'm getting these logs:

2020-12-08T01:33:42Z [relay_server::actors::upstream] INFO: registering with upstream (http://web.sentry.svc:9190/)
2020-12-08T01:33:42Z [relay_server::actors::upstream] INFO: relay successfully registered with upstream
2020-12-08T01:33:42Z [relay_server::actors::project_upstream] ERROR: error fetching project states: could not send request to upstream
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: IO error: Connection reset by peer (os error 104)
  caused by: Connection reset by peer (os error 104)
2020-12-08T01:33:43Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
  caused by: failed to fetch project state from upstream
2020-12-08T01:33:43Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
  caused by: failed to fetch project state from upstream
2020-12-08T01:33:43Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
  caused by: failed to fetch project state from upstream
2020-12-08T01:33:43Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
  caused by: failed to fetch project state from upstream
2020-12-08T01:33:43Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
  caused by: failed to fetch project state from upstream
2020-12-08T01:33:43Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
  caused by: failed to fetch project state from upstream
mvaled commented 3 years ago

I tried once more to use reqwest, but it failed again with 400 Bad request. But I wasn't able to catch the request to why sentry's complaining about it. And now I'm back to the cycle of ERROR: authentication encountered error...

I'll report back tomorrow if I find anything.

Best regards.

untitaker commented 3 years ago

@mvaled I have opened #878 for the 400 bug, let me know if that fixes the issue

mvaled commented 3 years ago

@untitaker

As commented there the issue still happens. As soon as I switch to reqwest, I get the 400 error. This is the full log for the first seconds after restarting relay with reqwest:

2020-12-17T12:31:43Z [relay::setup] INFO: launching relay from config folder /work/.relay
2020-12-17T12:31:43Z [relay::setup] INFO:   relay mode: managed
2020-12-17T12:31:43Z [relay::setup] INFO:   relay id: <redacted>
2020-12-17T12:31:43Z [relay::setup] INFO:   public key: <redacted>
2020-12-17T12:31:43Z [relay::setup] INFO:   log level: INFO
2020-12-17T12:31:43Z [relay_server::actors::outcome::processing] INFO: OutcomeProducer started.
2020-12-17T12:31:43Z [relay_server::actors::events] INFO: starting 8 event processing workers
2020-12-17T12:31:43Z [relay_server::actors::upstream] INFO: upstream relay started
2020-12-17T12:31:43Z [relay_server::actors::upstream] INFO: registering with upstream (http://web.sentry.svc:9190/)
2020-12-17T12:31:43Z [relay_server::actors::store] INFO: store forwarder started
2020-12-17T12:31:43Z [relay_server::actors::project_redis] INFO: redis project cache started
2020-12-17T12:31:43Z [relay_server::actors::project_redis] INFO: redis project cache started
2020-12-17T12:31:43Z [relay_server::actors::project_redis] INFO: redis project cache started
2020-12-17T12:31:43Z [relay_server::actors::project_redis] INFO: redis project cache started
2020-12-17T12:31:43Z [relay_server::actors::project_redis] INFO: redis project cache started
2020-12-17T12:31:43Z [relay_server::actors::project_redis] INFO: redis project cache started
2020-12-17T12:31:43Z [relay_server::actors::project_redis] INFO: redis project cache started
2020-12-17T12:31:43Z [relay_server::service] INFO: spawning http server
2020-12-17T12:31:43Z [relay_server::actors::project_redis] INFO: redis project cache started
2020-12-17T12:31:43Z [relay_server::service] INFO:   listening on: http://0.0.0.0:3000/
2020-12-17T12:31:43Z [actix_net::server::server] INFO: Starting 8 workers
2020-12-17T12:31:43Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
2020-12-17T12:31:43Z [relay_server::actors::upstream] INFO: registering with upstream (http://web.sentry.svc:9190/)
2020-12-17T12:31:43Z [actix_net::server::server] INFO: Starting server on 0.0.0.0:3000
2020-12-17T12:31:43Z [relay_server::actors::controller] INFO: relay server starting
2020-12-17T12:31:43Z [relay_server::actors::connector] INFO: metered connector started
2020-12-17T12:31:43Z [relay_server::actors::events] INFO: event manager started
2020-12-17T12:31:43Z [relay_server::actors::project_local] INFO: project local cache started
2020-12-17T12:31:43Z [relay_server::actors::project_upstream] INFO: project upstream cache started
2020-12-17T12:31:43Z [relay_server::actors::project_cache] INFO: project cache started
2020-12-17T12:31:43Z [relay_server::actors::relays] INFO: key cache started
2020-12-17T12:31:43Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
2020-12-17T12:31:44Z [relay_server::actors::upstream] INFO: registering with upstream (http://web.sentry.svc:9190/)
2020-12-17T12:31:44Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
untitaker commented 3 years ago

Fair enough, we merged this because it fixed the issue for another customer. I'm not really sure what to try then

Probably, but right now I have no clue. I would have to look the at the code to see if I could trap the request/response or put DEBUG=True in the webserver: Would it help?

assuming you're sure you're running the correct commit, let's try request capturing: I was thinking more of using wireshark assuming you have unencrypted traffic.

mvaled commented 3 years ago

Hi @untitaker,

I'm rebuilding the image from master:

$ git log -1 --oneline
33f0dfbb feat: Enable use of multiple value types per field (#882)

$ build docker . -t getsentry/relay:nightly

So, I'll be running relay with the right commit.

As for using wireshark, I'm not really sure if I'm allowed. Let's find out.

mvaled commented 3 years ago

Unfortunately, using wireshark is not possible. I don't have physical access to the server and X11 forwarding via SSH is not allowed :(

mvaled commented 3 years ago

I'm gonna use tcpdump (or something like that) in the server and import it in wireshark locally.

mvaled commented 3 years ago

@untitaker

Using tcpdump I could sniff the traffic going to the upstream. Once with actix, and once with reqwest. I think I have spotted a difference:

Screenshot from 2020-12-18 11-31-30

On the left is the actix trace, on the right is the reqwest. The JSON payload using reqwest is Array (seemingly empty) instead of an object.

mvaled commented 3 years ago

Looking again at the traces I noticed my previous comment was a response coming from the server, not the POST send by the relay. So I looked again.

In the good trace, I observe the following:

Relay sends a POST to upstream with a JSON payload {relay_id: ..., public_key: ..., version: ...}. There are two custom HTTP headers: x-sentry-relay-id: bcfd9d5a-... which matches the relay_id in the payload; and x-sentry-relay-signature, which I assume is a signature of the HTTP request with headers included. Other headers are:

  content-type: application/json
  accept-encoding: gzip, deflate
  user-agent: actix-web/0.7.19
  content-length: 130
  date: Fri, 18 Dec 2020 16:11:52 GMT

The upstream returns 200, with a JSON payload {relay_id: ..., token: ...}

In the "bad" trace:

A similar POST is done. The JSON payload is the same as before. But some HTTP headers are different:

content-type: application/json
content-encoding: identity
accept: */*
accept-encoding: gzip
transfer-encoding: chunked

I hope this may provide some clues about what's happening.

jan-auer commented 3 years ago

Thank you, this in fact points out the problem: transfer-encoding: chunked. The sentry web container is not able to handle this, and we should not be creating such a request. We'll submit a bugfix for this, thank you for uncovering it. (Edit: https://github.com/getsentry/relay/pull/890).

The configuration option you're setting, _client: reqwest, is unstable and still being developed. I'm curious what @untitaker wanted to test with this. In the meanwhile, I would suggest to switch back to the regular client.

As this is likely an issue with the way the containers are being run or networking is set up, I would propose to move this issue to getsentry/onpremise. Any objections, @BYK?

jan-auer commented 1 year ago

Closing since both self-hosted and Relay have changed substantially. If a similar bug is still occurring, please re-submit to the self-hosted repository. Thanks!