fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.85k stars 1.58k forks source link

Cannot configure Stackdriver output plugin #761

Closed theFroh closed 4 years ago

theFroh commented 6 years ago

Bug Report

Describe the bug I have followed the configuration guide for Stackdriver in the manual, but have had no success in establishing a connection to Stackdriver.

To Reproduce

  1. Install fluent-bit on an Ubuntu 16.04 LTS box
  2. Create a service account following Google's instructions and copy the JSON key into /etc/google/auth/
  3. Modify /etc/td-agent-bit/td-agent-bit.conf to include:
    [OUTPUT]
        Name  stackdriver
        Match *
        google_service_credentials /etc/google/auth/________.json
  4. Restart the agent to reload the configuration via systemctl restart td-agent-bit.service
  5. Note that the authorisation phase of connecting to Stackdriver fails via systemctl status td-agent-bit.service:
    Sep 11 02:24:10 hostname td-agent-bit[16981]: [2018/09/11 02:24:10] [ info] [engine] started (pid=16981)
    Sep 11 02:24:10 hostname td-agent-bit[16981]: [2018/09/11 02:24:10] [error] [oauth2] could not get an upstream connection
    Sep 11 02:24:10 hostname td-agent-bit[16981]: [2018/09/11 02:24:10] [error] [out_stackdriver] error retrieving oauth2 access token
    Sep 11 02:24:10 hostname td-agent-bit[16981]: [2018/09/11 02:24:10] [ warn] [out_stackdriver] token retrieval failed

Expected behavior I expected authentication to succeed against Stackdriver.

Your Environment

Additional context I'm trying to use fluent-bit to consume and send through server stats from a VPS we have, that is not part of our Google Cloud cluster.

edsiper commented 6 years ago

Hi @theFroh,

looking at the error I see the following

...[2018/09/11 02:24:10] [error] [oauth2] could not get an upstream connection

that means that the plugin could not establish a network connection with Google services, please validate in your end that your system can reach the following HTTPs end-points:

theFroh commented 6 years ago

Hey @edsiper,

The machine definitely has outbound access, and in particular, those two end-points are definitely accessible from the machine:

$ nmap -p 443 logging.googleapis.com www.googleapis.com

Starting Nmap 7.01 ( https://nmap.org ) at 2018-09-12 05:37 UTC
Nmap scan report for logging.googleapis.com (172.217.25.170)
Host is up (0.0018s latency).
Other addresses for logging.googleapis.com (not scanned): 2404:6800:4006:803::200a 172.217.167.74 172.217.167.106 216.58.196.138 216.58.199.74 216.58.200.106 216.58.203.106 216.58.220.106 172.217.25.138
rDNS record for 172.217.25.170: sin01s16-in-f10.1e100.net
PORT    STATE SERVICE
443/tcp open  https

Nmap scan report for www.googleapis.com (216.58.203.106)
Host is up (0.0017s latency).
Other addresses for www.googleapis.com (not scanned): 2404:6800:4006:803::200a 216.58.220.138 172.217.25.138 172.217.167.74 172.217.167.106 216.58.196.138 216.58.199.42 216.58.199.74 216.58.200.106
rDNS record for 216.58.203.106: syd09s15-in-f10.1e100.net
PORT    STATE SERVICE
443/tcp open  https

Cheers for assisting!

edsiper commented 6 years ago

would you please trace debug messages with 'Log_Level trace' (in [SERVICE] section) and share the output ?

theFroh commented 6 years ago

No worries, that only really adds a JWT signature printout, though.

Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [ info] [engine] started (pid=1810)
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [debug] [out_stackdriver] JWT signature:
Sep 17 01:16:52 hostname td-agent-bit[1810]: xxx.xxx.xxx
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [error] [oauth2] could not get an upstream connection
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [error] [out_stackdriver] error retrieving oauth2 access token
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [ warn] [out_stackdriver] token retrieval failed
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [debug] [router] match rule cpu.0:stdout.0
Sep 17 01:16:52 hostname td-agent-bit[1810]: [2018/09/17 01:16:52] [debug] [router] match rule cpu.0:stackdriver.0

The JWT signature has a payload containing (with our correct account name removed):

{
  "iss": "<STATS SERVICE ACCOUNT>@<PROJECT NAME>.iam.gserviceaccount.com",
  "scope": "https://www.googleapis.com/auth/logging.write",
  "aud": "https://www.googleapis.com/oauth2/v4/token",
  "exp": 1537150012,
  "iat": 1537147012
}

And header:

{
  "alg": "RS256",
  "typ": "JWT"
}

I can't check if the JWT itself is valid as I've not got the secret or public key to verify with.

edsiper commented 6 years ago

I will try to replicate the problem in a 16.04 box, I tested again in my 18.04 and works fine.

edsiper commented 6 years ago

no issues here, if you generate a new token file does it works ?

theFroh commented 6 years ago

What is providing your 16.04 testing box? Mine is just a standard, run of the mill VPS; not provided by AWS or the like.

To generate a new token, I've followed the following steps from Google as they seem the most applicable:

  1. Checked "Authorizing an Agent" -- which indicates that I should create a Service Account.
  2. Followed here and created a new Service Account with both Logging > Logs Writer and Monitoring > Monitoring Metric Writer roles.
  3. Used the default JSON private key export option to generate a key file.
  4. Moved this key file onto the server in question, dropped it into /etc/google/auth/ and then updated /etc/td-agent-bit/td-agent-bit.conf so that google_service_credentials is set correctly.
  5. systemctl restart td-agent-bit and systemctl status td-agent-bit

This reports the same [error] [oauth2] could not get an upstream connection

Am I missing any steps here, or misinterpretting any of the documentation, whether on Fluent Bit's or Google's end?

EDIT: I have also just nabbed the JWT signature from the logs again; it is definitely referencing the correct account in there.

stevenarvar commented 6 years ago

I deployed fluentbit 0.14 in K8S cluster.

The important config is the env variable QA >> kubectl exec fluent-bit-77zr7 -n kube-system env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=fluent-bit-77zr7 GOOGLE_SERVICE_CREDENTIALS=/gcp/stackdriver-service-account.json

From the fluent-bit-ds.yaml file:

    spec:
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:0.14.5
        imagePullPolicy: Always
        ports:
          - containerPort: 2020
        env:
        - name: GOOGLE_SERVICE_CREDENTIALS
          value: /gcp/stackdriver-service-account.json
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: ssa-volume
          mountPath: /gcp
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
.
.
.
.
      volumes:
      - name: ssa-volume
        secret:
          secretName: stackdriver-service-account

The above config need to have a secrete created like so:

kubectl create secret generic --namespace=kube-system stackdriver-service-account --from-file=./stackdriver-service-account.json

I mostly following instruction from here: https://docs.fluentbit.io/manual/installation/kubernetes

swapped the elasticsearch OUTPUT with stackdriver. But I also tried the simple configmap suggested here: https://docs.fluentbit.io/manual/output/stackdriver

Got the StackDriver authentication working I believe:

QA >> kubectl logs -n kube-system fluent-bit-77zr7
Fluent-Bit v0.14.5
Copyright (C) Treasure Data

[2018/10/30 19:28:51] [ info] [engine] started (pid=1)
[2018/10/30 19:28:51] [ info] [oauth2] HTTP Status=200
[2018/10/30 19:28:51] [ info] [oauth2] access token from 'www.googleapis.com:443' retrieved

Problem is I don't see logs in my stackdriver project.

The final configmap I use is:

QA >> cat fluent-bit-configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: kube-system
  labels:
    k8s-app: fluent-bit
data:
  fluent-bit.conf: |
    [INPUT]
        Name  cpu
        Tag   cpu

    [OUTPUT]
        Name        stackdriver
        Match       *

I did not set the env variables such as SERVICE_ACCOUNT_EMAIL & SERVICE_ACCOUNT_SECRET because I already have GOOGLE_SERVICE_CREDENTIALS setup. I did not set resource to global thinking this is already the default.

Is there any other logs I can get to dig in more? Don't know what else to try at this point.

varun-da commented 5 years ago

@stevenarvar look under Global, the first filter. Not under the service account.

varun-da commented 5 years ago

@theFroh this is definitely a issue with not being able to hit the google api servers from the box. Please check your connectivity from the box to those services. I was getting the same error and once I enabled the traffic to go through it works. Although in the beginning of the pod I do get a few errors but afterwards it works. The reason for initial connection failure in my env is I am running istio and those pods have to init before the traffic is routed correctly. I have tested with v0.14.9 and v1.0.1.

I had to enable traffic to the following urls:

logging.googleapis.com
www.googleapis.com

logs:

Fluent-Bit v0.14.9
Copyright (C) Treasure Data

[2019/01/07 23:24:09] [ info] [engine] started (pid=1)
[2019/01/07 23:24:09] [error] [oauth2] could not get an upstream connection
[2019/01/07 23:24:09] [error] [out_stackdriver] error retrieving oauth2 access token
[2019/01/07 23:24:09] [ warn] [out_stackdriver] token retrieval failed
.
.
.
[2019/01/07 23:24:10] [error] [io] TCP connection failed: logging.googleapis.com:443 (Connection refused)
.
.
.
[2019/01/07 23:24:12] [ info] [oauth2] HTTP Status=200
[2019/01/07 23:24:12] [ info] [oauth2] access token from 'www.googleapis.com:443' retrieved
Fluent Bit v1.1.0
Copyright (C) Treasure Data

[2019/01/08 16:41:39] [ info] [storage] initializing...
[2019/01/08 16:41:39] [ info] [storage] in-memory
[2019/01/08 16:41:39] [ info] [storage] normal synchronization mode, checksum disabled
[2019/01/08 16:41:39] [ info] [engine] started (pid=1)
[2019/01/08 16:41:39] [error] [oauth2] could not get an upstream connection
[2019/01/08 16:41:39] [error] [out_stackdriver] error retrieving oauth2 access token
[2019/01/08 16:41:39] [ warn] [out_stackdriver] token retrieval failed
.
.
.
[2019/01/08 16:41:40] [error] [io] TCP connection failed: logging.googleapis.com:443 (Connection refused)
.
.
.
[2019/01/08 16:41:43] [ info] [oauth2] HTTP Status=200
[2019/01/08 16:41:43] [ info] [oauth2] access token from 'www.googleapis.com:443' retrieved
edsiper commented 5 years ago

is there any extra information that we could add to the documentation ? or is it good to close the ticket ?

varun-da commented 5 years ago

@edsiper the two domains should be added to the docs. And in the logging it should print the full url to which the access was deined or the request failed at, for examplemade a call to https://www.googleapis.com/oauth2/token to get the token and failed, connection refused (or in case of a HTTP error, received HTTP: 404, etc.). This way it is clear what is happening from the logs.

theFroh commented 5 years ago

@varun-da Just in response to your own reply before, definitely understand that it is a likely cause, but the first thing we checked off in this issue was connectivity from the box to those two addresses. I can confirm I still have connectivity.

I'm still hitting the issue, though:

Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [ info] [engine] started (pid=15149)
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [debug] [out_stackdriver] JWT signature:
Jan 09 09:00:15 hostname td-agent-bit[15149]: removed
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [error] [oauth2] could not get an upstream connection
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [error] [out_stackdriver] error retrieving oauth2 access token
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [ warn] [out_stackdriver] token retrieval failed
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [debug] [router] match rule cpu.0:stdout.0
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [debug] [router] match rule cpu.0:stackdriver.0
Jan 09 09:00:15 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
Jan 09 09:00:16 hostname td-agent-bit[15149]: [2019/01/09 09:00:15] [debug] [input cpu.0] [mem buf] size = 317

Cheers for the assistance!

varun-da commented 5 years ago

@theFroh the next step I would take is making a call using curl with verbosity andd using the JWT token to the googleapis.com server to get the oauth2 token from that box. perhaps @edsiper can point to the documentation for doing this.

I think I found it: https://developers.google.com/identity/protocols/OAuth2ServiceAccount

Example from the page, I added the -v flag, and you would have to replace the JWT token with generated by the fluent-bit instance on that machine JWT token:

curl -v -d 'grant_type=urn%3Aietf%3Aparams%3Aoauth%3Agrant-type%3Ajwt-bearer&assertion=<JWT token from fluent-bit instance>' https://www.googleapis.com/oauth2/v4/token

curl -v -d 'grant_type=urn%3Aietf%3Aparams%3Aoauth%3Agrant-type%3Ajwt-bearer&assertion=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiI3NjEzMjY3OTgwNjktcjVtbGpsbG4xcmQ0bHJiaGc3NWVmZ2lncDM2bTc4ajVAZGV2ZWxvcGVyLmdzZXJ2aWNlYWNjb3VudC5jb20iLCJzY29wZSI6Imh0dHBzOi8vd3d3Lmdvb2dsZWFwaXMuY29tL2F1dGgvcHJlZGljdGlvbiIsImF1ZCI6Imh0dHBzOi8vYWNjb3VudHMuZ29vZ2xlLmNvbS9vL29hdXRoMi90b2tlbiIsImV4cCI6MTMyODU3MzM4MSwiaWF0IjoxMzI4NTY5NzgxfQ.RZVpzWygMLuL-n3GwjW1_yhQhrqDacyvaXkuf8HcJl8EtXYjGjMaW5oiM5cgAaIorrqgYlp4DPF_GuncFqg9uDZrx7pMmCZ_yHfxhSCXru3gbXrZvAIicNQZMFxrEEn4REVuq7DjkTMyCMGCY1dpMa8aWfTQFt3Eh7smLchaZsU' https://www.googleapis.com/oauth2/v4/token

This would definitely help in debugging this further.

theFroh commented 5 years ago

@varun-da Ah, that's definitely a great way to test here.

Running it myself with the token as reported in the logs yields a success in my books:

*   Trying 2404:6800:4006:802::200a...
* Connected to www.googleapis.com (2404:6800:4006:802::200a) port 443 (#0)
* found 148 certificates in /etc/ssl/certs/ca-certificates.crt
* found 596 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_ECDSA_AES_128_GCM_SHA256
*    server certificate verification OK
*    server certificate status verification SKIPPED
*    common name: *.googleapis.com (matched)
*    server certificate expiration date OK
*    server certificate activation date OK
*    certificate public key: EC
*    certificate version: #3
*    subject: C=US,ST=California,L=Mountain View,O=Google LLC,CN=*.googleapis.com
*    start date: Wed, 19 Dec 2018 08:17:00 GMT
*    expire date: Wed, 13 Mar 2019 08:17:00 GMT
*    issuer: C=US,O=Google Trust Services,CN=Google Internet Authority G3
*    compression: NULL
* ALPN, server accepted to use http/1.1
> POST /oauth2/v4/token HTTP/1.1
> Host: www.googleapis.com
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Length: 747
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 747 out of 747 bytes
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=utf-8
< Vary: X-Origin
< Vary: Referer
< Date: Fri, 11 Jan 2019 01:56:36 GMT
< Server: ESF
< Cache-Control: private
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
< X-Content-Type-Options: nosniff
< Alt-Svc: quic=":443"; ma=2592000; v="44,43,39,35"
< Accept-Ranges: none
< Vary: Origin,Accept-Encoding
< Transfer-Encoding: chunked
< 
{
  "access_token": "<access token omitted>",
  "expires_in": 3600,
  "token_type": "Bearer"
* Connection #0 to host www.googleapis.com left intact
}

Which doesn't really clear anything up unfortunately. I wonder how Fluentbit's networking differs.

sudharsh commented 5 years ago

+1, I am hit by this too. I get a 200 when I do the curl with the JWT token copied from the logs, and the same oauth error from fluentbit logs.

jakeswenson commented 5 years ago

I'm getting the exact same thing:

....
 Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
* We are completely uploaded and fine
< HTTP/2 200
< content-type: application/json; charset=utf-8
< vary: X-Origin
< vary: Referer
< vary: Origin,Accept-Encoding
< date: Mon, 25 Mar 2019 19:21:31 GMT
< server: ESF
< cache-control: private
< x-xss-protection: 1; mode=block
< x-frame-options: SAMEORIGIN
< x-content-type-options: nosniff
< alt-svc: quic=":443"; ma=2592000; v="46,44,43,39"
< accept-ranges: none
<
{
  "access_token": "Removed",
  "expires_in": 3600,
  "token_type": "Bearer"
* Connection #0 to host www.googleapis.com left intact
}
jakeswenson commented 5 years ago

How did other folks resolve this?

Fluent Bit v1.0.4
Copyright (C) Treasure Data

[2019/03/26 15:55:49] [debug] [storage] [cio stream] new stream registered: syslog.0
[2019/03/26 15:55:49] [ info] [storage] initializing...
[2019/03/26 15:55:49] [ info] [storage] in-memory
[2019/03/26 15:55:49] [ info] [storage] normal synchronization mode, checksum disabled
[2019/03/26 15:55:49] [ info] [engine] started (pid=40718)
[2019/03/26 15:55:49] [debug] [engine] coroutine stack size: 65536 bytes (64.0K)
[2019/03/26 15:55:49] [ info] [in_syslog] UDP buffer size set to 32768 bytes
[2019/03/26 15:55:49] [debug] [out_stackdriver] JWT signature: <SNIP>
[2019/03/26 15:55:49] [error] [oauth2] could not get an upstream connection
[2019/03/26 15:55:49] [error] [out_stackdriver] error retrieving oauth2 access token
[2019/03/26 15:55:49] [ warn] [out_stackdriver] token retrieval failed
[2019/03/26 15:55:49] [debug] [router] match rule syslog.0:stdout.0
[2019/03/26 15:55:49] [debug] [router] match rule syslog.0:stackdriver.0

i've tracked the error back to this line: https://github.com/fluent/fluent-bit/blob/ba0e6c5b0f44b484dfe06b9b05771ecdd78a61dd/src/flb_oauth2.c#L324 i don't know what can cause flb_upstream_conn_get to fail...

theFroh commented 5 years ago

I was never able to.

edsiper commented 5 years ago

that specific upstream connection error is a TCP connection error reaching the HTTPS end-point.

jakeswenson commented 5 years ago

Thanks for the pointer @edsiper In my case this is on a freebsd jail, but curl works fine with https reaching the google apis. any pointers as to how to diagnose this SSL/TLS issue? I can try getting a tcp dump to see if that shows any issues...

edsiper commented 5 years ago

@jakeswenson did you try tls.debug N ?:

https://docs.fluentbit.io/manual/configuration/tls_ssl

If you try to do the same thing in a Linux box does it works ? I am wondering if is there any issue on BSD that needs to be fixed.

jakeswenson commented 5 years ago

@edsiper i just tried with that setting and i am seeing not new output. Does stackdriver respect this tls setting?

Fluent Bit v1.0.4
Copyright (C) Treasure Data

[2019/03/28 13:11:51] [debug] [storage] [cio stream] new stream registered: dummy.0
[2019/03/28 13:11:51] [debug] [storage] [cio stream] new stream registered: syslog.0
[2019/03/28 13:11:51] [ info] [storage] initializing...
[2019/03/28 13:11:51] [ info] [storage] in-memory
[2019/03/28 13:11:51] [ info] [storage] normal synchronization mode, checksum disabled
[2019/03/28 13:11:51] [ info] [engine] started (pid=87027)
[2019/03/28 13:11:51] [debug] [engine] coroutine stack size: 65536 bytes (64.0K)
[2019/03/28 13:11:51] [ info] [in_syslog] UDP buffer size set to 32768 bytes
[2019/03/28 13:11:51] [debug] [out_stackdriver] JWT signature: <SNIP>
[2019/03/28 13:11:51] [error] [oauth2] could not get an upstream connection
[2019/03/28 13:11:51] [error] [out_stackdriver] error retrieving oauth2 access token
[2019/03/28 13:11:51] [ warn] [out_stackdriver] token retrieval failed
[2019/03/28 13:11:51] [debug] [router] match rule dummy.0:stdout.0
[2019/03/28 13:11:51] [debug] [router] match rule dummy.0:stackdriver.0
[0] dummy.log: [1553803912.848473852, {"message"=>"dummy"}]
[2019/03/28 13:11:56] [debug] [task] created task=0x801c40300 id=0 OK
[1] dummy.log: [1553803913.852387878, {"message"=>"dummy"}]
[2] dummy.log: [1553803914.863814322, {"message"=>"dummy"}]
[3] dummy.log: [1553803915.908904521, {"message"=>"dummy"}]
[2019/03/28 13:11:56] [debug] [retry] new retry created for task_id=0 attemps=1
[2019/03/28 13:11:56] [debug] [sched] retry=0x801c26f80 0 in 11 seconds

I ran with tls.debug 3

here is my config

[SERVICE]
    Flush 5
    Daemon off
    Log_Level trace
    Coro_Stack_Size 65536
    Parsers_File /usr/local/etc/fluent-bit/parsers.conf
[INPUT]
    Name dummy
    Tag dummy.log
[INPUT]
    Name syslog
    Path /tmp/in_syslog
    Chunk_Size 32
    Buffer_Size 64
    Tag syslog.log
[OUTPUT]
    Name stdout
    Match dummy.*
[OUTPUT]
    Name stackdriver
    Match dummy.*
    google_service_credentials /etc/gcp.creds.json
    resource global
    tls        On
    tls.verify Off
    tls.debug 3

also i ran a tcpdump and the only traffic i am getting is DNS requests for www.googleapis.com and logging.googleapis.com (both resolve) and no actual TCP traffic... image

i can try to find a linux box to try this on, but it may take some time... until then it seems like the error is in the http library after dns but before actually sending a packet.... any thoughts @edsiper?

edsiper commented 5 years ago

we use a pretty common libc function to resolve DNS:

https://github.com/fluent/fluent-bit/blob/master/src/flb_network.c#L215

hmm not sure what can be since at least you should see a warning or error message.

jakeswenson commented 5 years ago

i've been able to patch a build my own version of fluent bit to print a bit more logging to try and find where the error is. https://github.com/fluent/fluent-bit/blob/master/src/flb_network.c#L311 this line is failing with errno 22 (EINVAL) i have no idea why or what this means... any thoughts @edsiper?

edsiper commented 5 years ago

EINVAL = invalid argument, which function returned that ? connect () ?

On Thu, Mar 28, 2019 at 4:16 PM Jake Swenson notifications@github.com wrote:

i've been able to patch a build my own version of fluent bit to print a bit more logging to try and find where the error is. https://github.com/fluent/fluent-bit/blob/master/src/flb_network.c#L311 this line is failing with errno 22 (EINVAL) i have no idea why or what this means... any thoughts @edsiper https://github.com/edsiper?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/761#issuecomment-477791510, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWkNhktng7HtyeJI8WyHAFtjD_V6OFjks5vbT9VgaJpZM4WigHJ .

-- Eduardo Silva Blog: http://edsiper.linuxchile.cl Twitter: @edsiper http://twitter.com/edsiper OSS: http://monkey-project.com | http://duda.io | http://fluentbit.io

http://monkey-project.com

jakeswenson commented 5 years ago

yes, connect()

sebbacon commented 5 years ago

This appears to be related to ipv6. If I turn off ipv6 support as follows, things work as expected.

sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1
jakeswenson commented 5 years ago

Wait what? @sebbacon thanks for testing disabling ipv6 fixes. I think that it's a poor experience if instead of the plugin filtering the ipv6 address if it doesn't support it that I'd have to go modify my machine to disable ipv6 to run fluent-bit? can anyone point me at the code that is at issue and i can try to look in to fixing this?

jakeswenson commented 5 years ago

Also i can verify that i have ipv6 enabled (on loopback...) and that google (obviously) has an AAAA record:

# host www.googleapis.com                                                                   
www.googleapis.com is an alias for googleapis.l.google.com.                                             
googleapis.l.google.com has address 172.217.3.202                                                       
googleapis.l.google.com has address 172.217.14.202                                                      
googleapis.l.google.com has address 172.217.14.234                                                      
googleapis.l.google.com has IPv6 address 2607:f8b0:400a:803::200a
# ifconfig                                                                                  
lo0: flags=8048<LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384                                          
        options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>                                           
        inet6 ::1 prefixlen 128 tentative                                                               
        inet6 fe80::1%lo0 prefixlen 64 tentative scopeid 0x1                                            
        inet 127.0.0.1 netmask 0xff000000                                                               
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>                                                       
        groups: lo                                                                                      
epair1b: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500                           
        options=8<VLAN_MTU>                                                                 
        inet 10.0.51.50 netmask 0xffff0000 broadcast 10.0.255.255                                        
        nd6 options=1<PERFORMNUD>                                                                       
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)                                             
        status: active                                                                                  
        groups: epair
arabustams commented 5 years ago

In the environment in which priority set IPv6 higher than IPv4, I fined that it failed to establish upstream connection to oauth2 and stackdriver logging, so I reported #1348.

The fixes has been merged into v1.2. If you can use v1.2 fluentbit, run it with the following setting.

[OUTPUT]
    Name stackdriver
    Match *
    IPv6 On

You can specify IPv6 On in the configuration of out_stackdriver as other out plugin, out_stackdriver module use IPv6 mode explicitly.

However, oauth2 is a little different. In the fixes, the oauth2 module attempt to try to connect by IPv6 mode, if upstream connection by IPv4 was failed. I wonder if it might be better to make oauth2 module as configurable like out plugin...

arabustams commented 5 years ago

In addition, out_bigquery plugin probably has the same problem. Since I was not able to test using bigquery and it was enough for me to fix out_stackdriver, so I did not fix out_bigquery.

edsiper commented 5 years ago

thanks everyone for the report, I've added ipv6 mode to out_bigquery on 466191c3

jakeswenson commented 5 years ago

i'm built and ran fluent-bit 1.2.1 on my freebsd machine and i'm still getting the same error:

# ./fluent-bit -c /etc/logs.conf
Fluent Bit v1.2.1
Copyright (C) Treasure Data

[2019/07/19 08:43:32] [debug] [storage] [cio stream] new stream registered: dummy.0
[2019/07/19 08:43:32] [debug] [storage] [cio stream] new stream registered: syslog.1
[2019/07/19 08:43:32] [ info] [storage] initializing...
[2019/07/19 08:43:32] [ info] [storage] in-memory
[2019/07/19 08:43:32] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2019/07/19 08:43:32] [ info] [engine] started (pid=43877)
[2019/07/19 08:43:32] [debug] [engine] coroutine stack size: 65536 bytes (64.0K)
[2019/07/19 08:43:32] [ info] [in_syslog] UDP buffer size set to 32768 bytes
[2019/07/19 08:43:32] [debug] [out_stackdriver] JWT signature:
eyJhbG<SNIP>
[2019/07/19 08:43:32] [error] [oauth2] could not get an upstream connection
[2019/07/19 08:43:32] [error] [out_stackdriver] error retrieving oauth2 access token
[2019/07/19 08:43:32] [ warn] [out_stackdriver] token retrieval failed
[2019/07/19 08:43:32] [debug] [router] match rule dummy.0:stdout.0
[2019/07/19 08:43:32] [debug] [router] match rule dummy.0:stackdriver.1
[2019/07/19 08:43:32] [ info] [sp] stream processor started
^C[engine] caught signal (SIGINT)
[2019/07/19 08:43:35] [ info] [input] pausing dummy.0
[2019/07/19 08:43:35] [ info] [input] pausing syslog.1

config:

# cat /etc/logs.conf
[SERVICE]
        Flush 5
        Daemon off
        Log_Level trace
        Coro_Stack_Size 65536
        Parsers_File /usr/local/etc/fluent-bit/parsers.conf
[INPUT]
        Name dummy
        Tag dummy.log
[INPUT]
        Name syslog
        Path /tmp/in_syslog
        Chunk_Size 32
        Buffer_Size 64
        Tag syslog.log
[OUTPUT]
        Name stdout
        Match dummy.*
[OUTPUT]
        Name stackdriver
        Match dummy.*
        google_service_credentials /etc/gcp.creds.json
        resource global
        tls        On
        tls.verify Off
        tls.debug 4
        IPv6 On

i doesn't matter if i configure IPv6 to On or Off same error.

is there anything else i can do to help debug this?

edsiper commented 5 years ago

looks like the output above don't have trace messages, would you please re-run it ? (I see the trace enabled in the config, but I don't see it in the output)

jakeswenson commented 5 years ago

@edsiper as i'm sure you know trace requires fluent-bit to be built with tracing enabled... https://docs.fluentbit.io/manual/configuration/file#config_section

I'm certain it's not building that by default, and i need to read up on how its enabled using the options framework

Are there any log lines in particular you're looking for from tracing?

edsiper commented 4 years ago

FYI: Stackdriver output plugin has been improved heavily the latest team (thanks to Google team involvement in the project), I am closing this ticket. Pls create a new one if you still faces an issue.

rquinlivan commented 3 years ago

I am still seeing this in 1.7. The stackdriver plugin logs nothing even at trace.

rquinlivan commented 3 years ago

@theFroh @edsiper Can we reopen this issue? I am seeing the same issues with ipv6 reported in this thread. I installed the 1.7.4 amd64 version via the Debian package.

edsiper commented 3 years ago

for new issues please open a new ticket.

FYI: v1.7.6 was tested extensible with Stackdriver on Google Cloud: 10 hours run sending 150k messages per second, no issues found.