Granulate / gprofiler

gProfiler is a system-wide profiler, combining multiple sampling profilers to produce unified visualization of what your CPU is spending time on.
https://profiler.granulate.io
Apache License 2.0
741 stars 54 forks source link

gProfiler partial DNS resolution error #327

Open akuzkohqs opened 2 years ago

akuzkohqs commented 2 years ago

We have installed gProfiler in Docker containers in both, local (Docker engine, run in EC2 container with unrestricted outgoing connectivity) and remote (AWS Fargate with SYS_PTRACE enabled).

We saw no single metric uploaded in gProfiler dashboard yet.

Docker container base: node:13.7.0

Additional commands:

RUN apt update && apt install -y curl git && \
    curl -SL https://github.com/Granulate/gprofiler/releases/latest/download/gprofiler --output /tmp/gprofiler && \
    chmod +x /tmp/gprofiler 

Startup made from Docker entrypoint:

/tmp/gprofiler -cu --token ${GP_PAT} --service-name SERVICE_NAME --disable-pidns-check --perf-mode disabled --no-java --no-python --no-php --no-ruby --nodejs-mode none &

We also tried simplified command without any differences, same errors, etc.:

/tmp/gprofiler -cu --token ${GP_PAT} --service-name SERVICE_NAME --disable-pidns-check --perf-mode disabled

The problem is that while service listed in gProfiler there are massive errors like this:

  | 2022-03-29T16:57:06.890+03:00 | [13:57:06] Successfully uploaded profiling data to the server
  | 2022-03-29T16:57:11.902+03:00 | [13:57:11] Failed sending logs to server
  | 2022-03-29T16:57:11.902+03:00 | Traceback (most recent call last):
  | 2022-03-29T16:57:11.902+03:00 | File "urllib3/connectionpool.py", line 449, in _make_request
  | 2022-03-29T16:57:11.902+03:00 | File "<string>", line 3, in raise_from
  | 2022-03-29T16:57:11.902+03:00 | File "urllib3/connectionpool.py", line 444, in _make_request
  | 2022-03-29T16:57:11.902+03:00 | File "http/client.py", line 1346, in getresponse
  | 2022-03-29T16:57:11.902+03:00 | File "http/client.py", line 307, in begin
  | 2022-03-29T16:57:11.902+03:00 | File "http/client.py", line 268, in _read_status
  | 2022-03-29T16:57:11.902+03:00 | File "socket.py", line 586, in readinto
  | 2022-03-29T16:57:11.902+03:00 | File "ssl.py", line 971, in recv_into
  | 2022-03-29T16:57:11.902+03:00 | File "ssl.py", line 833, in read
  | 2022-03-29T16:57:11.902+03:00 | File "ssl.py", line 590, in read
  | 2022-03-29T16:57:11.902+03:00 | socket.timeout: The read operation timed out
  | 2022-03-29T16:57:11.902+03:00 | During handling of the above exception, another exception occurred:
  | 2022-03-29T16:57:11.902+03:00 | Traceback (most recent call last):
  | 2022-03-29T16:57:11.902+03:00 | File "requests/adapters.py", line 449, in send
  | 2022-03-29T16:57:11.902+03:00 | File "urllib3/connectionpool.py", line 786, in urlopen
  | 2022-03-29T16:57:11.902+03:00 | File "urllib3/util/retry.py", line 550, in increment
  | 2022-03-29T16:57:11.902+03:00 | File "urllib3/packages/six.py", line 770, in reraise
  | 2022-03-29T16:57:11.902+03:00 | File "urllib3/connectionpool.py", line 710, in urlopen
  | 2022-03-29T16:57:11.902+03:00 | File "urllib3/connectionpool.py", line 451, in _make_request
  | 2022-03-29T16:57:11.902+03:00 | File "urllib3/connectionpool.py", line 341, in _raise_timeout
  | 2022-03-29T16:57:11.902+03:00 | urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='profiler.granulate.io', port=443): Read timed out. (read timeout=5)
  | 2022-03-29T16:57:11.902+03:00 | During handling of the above exception, another exception occurred:
  | 2022-03-29T16:57:11.902+03:00 | Traceback (most recent call last):
  | 2022-03-29T16:57:11.902+03:00 | File "gprofiler/log.py", line 191, in try_send_log_to_server
  | 2022-03-29T16:57:11.902+03:00 | File "gprofiler/client.py", line 110, in post
  | 2022-03-29T16:57:11.902+03:00 | File "gprofiler/client.py", line 95, in _send_request
  | 2022-03-29T16:57:11.902+03:00 | File "requests/sessions.py", line 542, in request
  | 2022-03-29T16:57:11.902+03:00 | File "requests/sessions.py", line 655, in send
  | 2022-03-29T16:57:11.902+03:00 | File "requests/adapters.py", line 529, in send
  | 2022-03-29T16:57:11.902+03:00 | requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='profiler.granulate.io', port=443): Read timed out. (read timeout=5)

We have tested network connectivity and we are able to open profiler.granulate.io using curl and other tools from inside the Docker container(s). Certificate is valid as well.

We may provide additional details if requested.

Thank you for your attention.

Jongy commented 2 years ago

Hi @akuzkohqs . This doesn't seem like a DNS error, which would yield an error like: Failed to establish a new connection: [Errno -2] Name or service not known'.

In this case it seems like a connection is made, but it is "too slow" so the read timeout expires. Feels like a networking issue.

What is the exact check you've ran with curl? Can you please post the output of curl -sSL -vv profiler.granulate.io.

By the way, per the log you have submitted, some profiling data was sent:

  | 2022-03-29T16:57:06.890+03:00 | [13:57:06] Successfully uploaded profiling data to the server

if it's from this /tmp/gprofiler -cu --token ${GP_PAT} --service-name SERVICE_NAME --disable-pidns-check --perf-mode disabled --no-java --no-python --no-php --no-ruby --nodejs-mode none & run then it might as well be empty - here you have disabled all profilers, so nothing collects any stacks :sweat_smile:

akuzkohqs commented 2 years ago

@Jongy thank you for your attention on this issue. Let me to answer your questions:

  1. I've used simple call curl -v https://profiler.granulate.io || true and it works all the time. Output will be at the bottom.
  2. "Successfully uploaded profiling data to the server" is confusing. It always followed with "Failed sending logs to server" with errors from above.
  3. What about disabled profilers... As said we are executing in AWS Fargate (and in standard Docker container when doing tests locally) so I followed documentation. Then I expanded it with extra --no-LANG and tried all options for --nodejs-mode but it accepts none only due to --perf-mode disabled

So, to conclude. Do we need to expect anything but CPU/Memory utilization sent and visualized in gProfiler dashboard when we have --perf-mode disabled (or --perf-mode none)? What could we do in gProfiler settings/options to configure (if needed) timeouts and/or other things?

curl output:

# curl -sSL -vv https://profiler.granulate.io
* Rebuilt URL to: https://profiler.granulate.io/
*   Trying 35.170.220.67...
* TCP_NODELAY set
* Connected to profiler.granulate.io (35.170.220.67) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=profiler.granulate.io
*  start date: Jan  5 00:00:00 2022 GMT
*  expire date: Feb  3 23:59:59 2023 GMT
*  subjectAltName: host "profiler.granulate.io" matched cert's "profiler.granulate.io"
*  issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x563a305caea0)
> GET / HTTP/1.1
> Host: profiler.granulate.io
> User-Agent: curl/7.52.1
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 200 
< date: Wed, 30 Mar 2022 15:04:12 GMT
< content-type: text/html
< content-length: 2862
< server: nginx/1.18.0
< last-modified: Tue, 29 Mar 2022 16:51:17 GMT
< vary: Accept-Encoding
< etag: "62433905-b2e"
< cache-control: no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0
< strict-transport-security: max-age=31536000; includeSubDomains
< accept-ranges: bytes
< 
<!doctype html><html lang="en"><head><meta charset="utf-8"/><link rel="shortcut icon" href="/favicon.ico"/><meta name="viewport" content="minimum-scale=1,initial-scale=1,width=device-width,shrink-to-fit=no"/><meta name="theme-color" content="#000000"/><link rel="manifest" href="/manifest.json"/><title>gProfiler</title><script src="/env-config.js"></script><script>!function(e,t,a,n,g){e[n]=e[n]||[],e[n].push({"gtm.start":(new Date).getTime(),event:"gtm.js"});var m=t.getElementsByTagName(a)[0],r=t.createElement(a);r.async=!0,r.src="https://www.googletagmanager.com/gtm.js?id=GTM-TNXVMCR",m.parentNode.insertBefore(r,m)}(window,document,"script","dataLayer")</script><link href="/static/css/2.ef749903.chunk.css" rel="stylesheet"><link href="/static/css/main.123bc548.chunk.css" rel="stylesheet"></head><body style="background-color:#f0f2f5"><noscript>You need to enable JavaScript to run this app.</noscript><noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-TNXVMCR" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript><div id="root"></div><script src="https://www.google.com/recaptcha/api.js?onload=onloadCallback&render=explicit" async defer="defer"></script><script>!function(e){function r(r){for(var n,l,f=r[0],i=r[1],p=r[2],c=0,s=[];c<f.length;c++)l=f[c],Object.prototype.hasOwnProperty.call(o,l)&&o[l]&&s.push(o[l][0]),o[l]=0;for(n in i)Object.prototype.hasOwnProperty.call(i,n)&&(e[n]=i[n]);for(a&&a(r);s.length;)s.shift()();return u.push.apply(u,p||[]),t()}function t(){for(var e,r=0;r<u.length;r++){for(var t=u[r],n=!0,f=1;f<t.length;f++){var i=t[f];0!==o[i]&&(n=!1)}n&&(u.splice(r--,1),e=l(l.s=t[0]))}return e}var n={},o={1:0},u=[];function l(r){if(n[r])return n[r].exports;var t=n[r]={i:r,l:!1,exports:{}};return e[r].call(t.exports,t,t.exports,l),t.l=!0,t.exports}l.m=e,l.c=n,l.d=function(e,r,t){l.o(e,r)||Object.defineProperty(e,r,{enumerable:!0,get:t})},l.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Ob* Curl_http_done: called premature == 0
* Connection #0 to host profiler.granulate.io left intact
ject.defineProperty(e,"__esModule",{value:!0})},l.t=function(e,r){if(1&r&&(e=l(e)),8&r)return e;if(4&r&&"object"==typeof e&&e&&e.__esModule)return e;var t=Object.create(null);if(l.r(t),Object.defineProperty(t,"default",{enumerable:!0,value:e}),2&r&&"string"!=typeof e)for(var n in e)l.d(t,n,function(r){return e[r]}.bind(null,n));return t},l.n=function(e){var r=e&&e.__esModule?function(){return e.default}:function(){return e};return l.d(r,"a",r),r},l.o=function(e,r){return Object.prototype.hasOwnProperty.call(e,r)},l.p="/";var f=this.webpackJsonpgprofiler=this.webpackJsonpgprofiler||[],i=f.push.bind(f);f.push=r,f=f.slice();for(var p=0;p<f.length;p++)r(f[p]);var a=i;t()}([])</script><script src="/static/js/2.6c3cd4ec.chunk.js"></script><script src="/static/js/main.d26e6baa.chunk.js"></script></body></html>
Jongy commented 2 years ago

Thanks for the details.

"Successfully uploaded profiling data to the server" is confusing. It always followed with "Failed sending logs to server" with errors from above.

These are actually 2 separate steps - and 2 different endpoints: one endpoint for the profiles (which reports success) and another separate endpoint for the logs, which reports a failure. While being unexpected, I do not deem it related to the issue of no profiling data being displayed in the UI (as I said - these are 2 different endpoints)

So, to conclude. Do we need to expect anything but CPU/Memory utilization sent and visualized in gProfiler dashboard when we have --perf-mode disabled (or --perf-mode none)? What could we do in gProfiler settings/options to configure (if needed) timeouts and/or other things?

Yes, definitely - you should be seeing data even with --perf-mode=none, it just means that we don't run perf, but we still run py-spy, async-profiler etc. What I meant in my comment is that if you run --no-perf and --no-java --no-python --no-php --no-ruby, then, well, no profilers are left :sweat_smile: and in that case you will most likely get no data because no profiler actually ran. This should be an error configuration and I will fix the commandline interface to raise if all profilers are disabled.

So the issue that remains is that you're seeing "Successfully uploaded profiling data" but not seeing any data on the server. Just making sure - if running on Fargate with --perf-mode=none, then other profilers for e.g Java/Python must be active, and actually identify processes (otherwise no profiling data is collected, because perf is responsible for collecting data from the entire system).

Could you please share a full log from a profiling session (i.e, run the profiling without -c for a single session)? You can post it here (possibly redacted if you prefer not to share all log lines), or you can also send it to me privately at yonatan dot goldschmidt at granulate dot io. I want to ensure that some profilers have actually been run and correlate that with the "Successfully uploaded data" message.

akuzkohqs commented 2 years ago

@Jongy ...things are changing fast.. Our team gave up with gProfiler ;( I was able to test things before got "Server error: Invalid API key" as they disabled account.

Btw, thank you for implementing fool-proof protection when all profilers are disabled.

What I found is that Java profiler was giving us a problem like that with /dev/kmsg error (it won't work in Fargate):

[06:40:00] Initialized JavaProfiler (frequency: 11hz, duration: 60s)
[06:40:00] Failed to start kernel messages listener. Profilee error monitoring not available. (Do you have permission to read /dev/kmsg?)
Traceback (most recent call last):
  File "gprofiler/kernel_messages.py", line 30, in get_kernel_messages_provider
  File "granulate_utils/linux/devkmsg.py", line 25, in __init__
FileNotFoundError: [Errno 2] No such file or directory: '/dev/kmsg'
[06:40:00] Initialized PythonEbpfProfiler (frequency: 11hz, duration: 60s)
[06:40:00] Python eBPF profiler initialization failed
[06:40:00] Initialized PySpyProfiler (frequency: 11hz, duration: 60s)
[06:40:00] Initialized RbSpyProfiler (frequency: 11hz, duration: 60s)
[06:40:00] Could not find a Docker daemon or CRI-compatible daemon, profiling data will not include the container names. If you do have a containers runtime and it's not supported, please open a new issue here: https://github.com/Granulate/gprofiler/issues/new
[06:40:00] gProfiler initialized and ready to start profiling
[06:40:00] Failed to enable proc_events listener for exited Java processes
Traceback (most recent call last):
  File "granulate_utils/linux/proc_events.py", line 211, in start
PermissionError: [Errno 1] Operation not permitted

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "granulate_utils/linux/ns.py", line 225, in _switch_and_run
  File "granulate_utils/linux/proc_events.py", line 249, in _start_listener
  File "granulate_utils/linux/proc_events.py", line 216, in start
PermissionError: This process doesn't have permissions to bind/connect to the process events connector

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "gprofiler/profilers/java.py", line 848, in start
  File "granulate_utils/linux/proc_events.py", line 261, in wrapper
  File "granulate_utils/linux/ns.py", line 237, in run_in_ns
Exception: run_in_ns execution failed
[06:40:00] Successfully uploaded profiling data to the server
[06:40:00] Stopping ...

Then I've tested with "--disable-pidns-check --perf-mode none --no-java"

[06:40:15] Initialized PythonEbpfProfiler (frequency: 11hz, duration: 60s)
[06:40:15] Python eBPF profiler initialization failed
[06:40:15] Initialized PySpyProfiler (frequency: 11hz, duration: 60s)
[06:40:15] Initialized RbSpyProfiler (frequency: 11hz, duration: 60s)
[06:40:15] Could not find a Docker daemon or CRI-compatible daemon, profiling data will not include the container names. If you do have a containers runtime and it's not supported, please open a new issue here: https://github.com/Granulate/gprofiler/issues/new
[06:40:15] gProfiler initialized and ready to start profiling
[06:40:15] Successfully uploaded profiling data to the server

As you may see no errors. But at the same time still zero samples as well. I was waiting for hour or so.

Hope it helps you to debug it in the future. I highlight that we do have Node.js process in same Docker container and this is what we wished to profile initially.

Let me know if I may help you further while I haven't gProfiler account anymore ;(

Jongy commented 2 years ago

I'm sorry to hear that @akuzkohqs :(

About the /dev/kmsg & proc_events error - these are again false positives, sorry for the confusion (and I will make sure it is addressed) - these are two optional features and the lack of them does not prevent gProfiler from, well, doing its profiling. I will make sure to amend the log messages such that they clarify that profiling is not harmed even if they are present.

Judging by your output logs, I think that my explanation here:

So the issue that remains is that you're seeing "Successfully uploaded profiling data" but not seeing any data on the server. Just making sure - if running on Fargate with --perf-mode=none, then other profilers for e.g Java/Python must be active, and actually identify processes (otherwise no profiling data is collected, because perf is responsible for collecting data from the entire system).

is what happens. There are no profilers actually running (because perf is disabled, and the different runtime profilers do not identify target processes). So gProfiler really uploads empty samples. Which is yet another error condition that we should at least log on, noted this as well.

If it's NodeJS that you want to profile over Fargate - that is sadly not supported yet (as Node is profiled with perf, and perf is not available on Fargate)

You can actually continue helping me debug even with the account gone, if you'd like :smiley_cat: