cloudfoundry-community / splunk-firehose-nozzle

Send CF component metrics, CF app logs, and CF app metrics to Splunk
Apache License 2.0
29 stars 29 forks source link

Release 1.2.3 Breaks on Oauth Token Retrieval from cfapi #302

Closed gaigaslab-operations closed 2 years ago

gaigaslab-operations commented 2 years ago

We upgraded to Tile 1.2.3 from 1.2.2 Immediately, the splunk-nozzle-1.2.3 app started crashing after reporting that it could not get a token from the api endpoint - authorization error.

I don't have the logs right now, since we reverted to 1.2.2 to recover. I will reproduce on our sandbox foundation and provide the logs.

gaigaslab-operations commented 2 years ago

Logs of splunk-firehose-nozzle-1.2.3 after upgrading from 1.2.2 to 1.2.3

2021-10-21T14:01:47.753-04:00 [CELL/0] [OUT] Downloading droplet... 2021-10-21T14:01:47.760-04:00 [CELL/1] [OUT] Cell 615ce47c-66d0-4e88-a417-6e7ea554053d successfully created container for instance ea1a13fa-a03c-4c90-6502-1696 2021-10-21T14:01:47.962-04:00 [CELL/0] [OUT] Downloaded droplet (5.9M) 2021-10-21T14:01:48.113-04:00 [APP/PROC/WEB/0] [OUT] {"timestamp":"1634839308.113173723","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Running splunk-firehose-nozzle","log_level":1,"data":{}} 2021-10-21T14:01:48.113-04:00 [APP/PROC/WEB/0] [OUT] {"timestamp":"1634839308.113698244","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Apps are not being cached. When apps are not cached, the org and space caching TTL is ineffective","log_level":1,"data":{}} 2021-10-21T14:01:48.114-04:00 [APP/PROC/WEB/0] [OUT] {"timestamp":"1634839308.114690542","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Running splunk-firehose-nozzle with following configuration variables ","log_level":1,"data":{"add-app-info":"[\"AppName\", \"OrgName\", \"SpaceName\"]","add-tags":false,"api-endpoint":"https://api.****","app-cache-ttl":0,"app-limits":0,"batch-size":100,"boltdb-path":"cache.db","branch":"master","buildos":"Darwin","commit":"fcbda2da45cb659c677174b95b7a06c0afa81a96","debug":false,"drop-warn-threshold":1000,"extra-fields":"foundation:cfsb-1","flush-interval":2000000000,"hec-workers":8,"ignore-missing-apps":true,"job-host":"","job-index":"-1","job-name":"splunk-nozzle","keep-alive":25000000000,"mem-queue-monitor-interval":0,"missing-app-cache-ttl":0,"org-space-cache-ttl":32400000000000,"queue-size":10000,"retries":3,"skip-ssl-cf":false,"skip-ssl-splunk":false,"splunk-host":"https://cloud.us.humio.com:443/api/v1/ingest/hec","splunk-index":"cfsb-1","subscription-id":"splunk-firehose","trace-logging":false,"version":"","wanted-events":"[\"LogMessage\", \"Error\"]"}} 2021-10-21T14:01:48.129-04:00 [CELL/1] [OUT] Downloading droplet... 2021-10-21T14:01:48.330-04:00 [CELL/1] [OUT] Downloaded droplet (5.9M) 2021-10-21T14:01:48.395-04:00 [APP/PROC/WEB/0] [OUT] {"timestamp":"1634839308.395050049","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Retrieving apps from remote","log_level":1,"data":{}} 2021-10-21T14:01:48.495-04:00 [APP/PROC/WEB/1] [OUT] {"timestamp":"1634839308.495010138","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Running splunk-firehose-nozzle","log_level":1,"data":{}} 2021-10-21T14:01:48.496-04:00 [APP/PROC/WEB/1] [OUT] {"timestamp":"1634839308.495557308","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Apps are not being cached. When apps are not cached, the org and space caching TTL is ineffective","log_level":1,"data":{}} 2021-10-21T14:01:48.496-04:00 [APP/PROC/WEB/1] [OUT] {"timestamp":"1634839308.496112347","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Running splunk-firehose-nozzle with following configuration variables ","log_level":1,"data":{"add-app-info":"[\"AppName\", \"OrgName\", \"SpaceName\"]","add-tags":false,"api-endpoint":"https://api.****","app-cache-ttl":0,"app-limits":0,"batch-size":100,"boltdb-path":"cache.db","branch":"master","buildos":"Darwin","commit":"fcbda2da45cb659c677174b95b7a06c0afa81a96","debug":false,"drop-warn-threshold":1000,"extra-fields":"foundation:cfsb-1","flush-interval":2000000000,"hec-workers":8,"ignore-missing-apps":true,"job-host":"","job-index":"-1","job-name":"splunk-nozzle","keep-alive":25000000000,"mem-queue-monitor-interval":0,"missing-app-cache-ttl":0,"org-space-cache-ttl":32400000000000,"queue-size":10000,"retries":3,"skip-ssl-cf":false,"skip-ssl-splunk":false,"splunk-host":"https://cloud.us.humio.com:443/api/v1/ingest/hec","splunk-index":"cfsb-1","subscription-id":"splunk-firehose","trace-logging":false,"version":"","wanted-events":"[\"LogMessage\", \"Error\"]"}} 2021-10-21T14:01:48.749-04:00 [APP/PROC/WEB/1] [OUT] {"timestamp":"1634839308.748988390","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Retrieving apps from remote","log_level":1,"data":{}} 2021-10-21T14:01:58.751-04:00 [APP/PROC/WEB/0] [OUT] {"timestamp":"1634839318.751422882","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed to open App Cache","log_level":2,"data":{}} 2021-10-21T14:01:58.817-04:00 [APP/PROC/WEB/0] [OUT] {"timestamp":"1634839318.817667723","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed to run splunk-firehose-nozzle","log_level":2,"data":{"error":"Error requesting apps: Get \"https://api.****/v2/apps?inline-relations-depth=0\": oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"unauthorized\",\"error_description\":\"Bad credentials\"}"}} 2021-10-21T14:01:59.130-04:00 [APP/PROC/WEB/1] [OUT] {"timestamp":"1634839319.130104303","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed to open App Cache","log_level":2,"data":{}} 2021-10-21T14:01:59.197-04:00 [APP/PROC/WEB/1] [OUT] {"timestamp":"1634839319.197067022","source":"splunk-nozzle-logger","message":"splunk-nozzle-logger.Failed to run splunk-firehose-nozzle","log_level":2,"data":{"error":"Error requesting apps: Get \"https://api.****/v2/apps?inline-relations-depth=0\": oauth2: cannot fetch token: 401 Unauthorized\nResponse: {\"error\":\"unauthorized\",\"error_description\":\"Bad credentials\"}"}} 2021-10-21T14:02:08.770-04:00 [APP/PROC/WEB/0] [OUT] Exit status 0 2021-10-21T14:02:09.444-04:00 [APP/PROC/WEB/1] [OUT] Exit status 0 2021-10-21T14:02:13.946-04:00 [CELL/0] [OUT] Cell f5d6edfb-179d-49fa-99ef-895ee714d7ba stopping instance 1054312b-c536-4816-5fcd-ae25 2021-10-21T14:02:13.947-04:00 [CELL/0] [OUT] Cell f5d6edfb-179d-49fa-99ef-895ee714d7ba destroying container for instance 1054312b-c536-4816-5fcd-ae25 2021-10-21T14:02:13.953-04:00 [API/0] [OUT] Process has crashed with type: "web" 2021-10-21T14:02:13.961-04:00 [API/0] [OUT] App instance exited with guid 3dbfc916-94fd-4419-8d45-2b8891ef7b88 payload: {"instance"=>"1054312b-c536-4816-5fcd-ae25", "index"=>0, "cell_id"=>"f5d6edfb-179d-49fa-99ef-895ee714d7ba", "reason"=>"CRASHED", "exit_description"=>"Codependent step exited", "crash_count"=>2, "crash_timestamp"=>1634839333935809636, "version"=>"ca55885a-3c59-447a-bdd0-58a2211bd966"} 2021-10-21T14:02:13.966-04:00 [CELL/0] [OUT] Cell 0996735d-3a84-4f5e-b0bd-310cbd0ffbc4 creating container for instance 5cbdc801-d202-49aa-57a9-c815 2021-10-21T14:02:14.193-04:00 [PROXY/0] [OUT] Exit status 137 2021-10-21T14:02:14.226-04:00 [CELL/0] [OUT] Cell f5d6edfb-179d-49fa-99ef-895ee714d7ba successfully destroyed container for instance 1054312b-c536-4816-5fcd-ae25

gaigaslab-operations commented 2 years ago

I have confirmed that the environment variables for CLIENT_ID and CLIENT_SECRET are the same on 1.2.2 (because it is working) and on 1.2.3

xyloman commented 2 years ago

I can confirm that we experienced the same behavior in our test environment after taking 1.2.3.

gaigaslab-operations commented 2 years ago

I just removed all of 1.2.3, 1.2.2 and deployed a fresh instance of 1.2.3.

The problem is identical. This indicates that it is probably not due to the migration/upgrade changes.

I couldn't see anything in the release diffs that would cause this issue.

Thank you @xyloman for confirming.

labrown commented 2 years ago

Perhaps an update in one of the libraries the app uses?

JuergenSu commented 2 years ago

HI, i can confirm we are on Nozzle release 1.2.3 and Tanzu Applicaiton Services 2.11.8 which is CC api 2.164.0 and the nozzle runs fine without this issue.

i know from an other user that they also dont experience this issue with TAS 2.11.X

kashyap-splunk commented 2 years ago

@gaigaslab-operations Thanks for reporting this. Can you please provide more details of the environment ? like TAS version, so that we can try to reproduce. (We had verified in TAS 2.11 and 2.7 LTS, and this issue was not there.)

gaigaslab-operations commented 2 years ago

TAS 2.11.8 LTS

gaigaslab-operations commented 2 years ago

Since the problem is about authorization, how is the app getting a token? Is it hitting uaa.?

gaigaslab-operations commented 2 years ago

@kashyap-splunk @JuergenSu Can you send over a redacted version of your environment for the splunk-firehose-nozzle-1.2.3? (cf env)

JuergenSu commented 2 years ago

Hi, this is our env

cf7 env splunk-firehose-nozzle-1.2.3
Getting env variables for app splunk-firehose-nozzle-1.2.3 in org Splunk-Nozzle-Org / space splunk-nozzle-space as [*****]...
System-Provided:
VCAP_SERVICES: {}

VCAP_APPLICATION: {
 "application_id": "6ac0ee8f-b8f3-4469-aa2a-9f087f667026",
 "application_name": "splunk-firehose-nozzle-1.2.3",
 "application_uris": [],
 "cf_api": "[API ENDPOINT]",
 "limits": {
  "fds": 16384
 },
 "name": "splunk-firehose-nozzle-1.2.3",
 "organization_id": "685cac63-fd0e-4a43-af77-fd7ca5939a46",
 "organization_name": "splunk-nozzle-org",
 "space_id": "1e1f6592-5bd8-46d8-b38e-2cb72c424066",
 "space_name": "splunk-nozzle-space",
 "uris": [],
 "users": null
}

User-Provided:
ADD_APP_INFO: ["AppName", "OrgName", "OrgGuid", "SpaceName", "SpaceGuid"]
ADD_TAGS: false
API_ENDPOINT: [*****]
APP_CACHE_INVALIDATE_TTL: 60s
APP_LIMITS: 5000
AUTHOR: Shubham Jain
CC_HOST: [*****]
CF_ORG: splunk-nozzle-org
CF_SKIP_SSL: false
CF_SPACE: splunk-nozzle-space
CF_TARGET: [*****]
CLIENT_ID: splunk-nozzle
CLIENT_SECRET: [*****]
CONSUMER_QUEUE_SIZE: 10000
DROP_WARN_THRESHOLD: 1000
ENABLE_EVENT_TRACING: true
EVENTS: ["HttpStartStop", "LogMessage", "Error", "ContainerMetric"]
EXTRA_FIELDS: longevity:testing
FIREHOSE_KEEP_ALIVE: 25s
FIREHOSE_SUBSCRIPTION_ID: splunk-firehose
FLUSH_INTERVAL: 5s
GOPACKAGENAME: main
HEC_BATCH_SIZE: 250
HEC_RETRIES: 5
HEC_WORKERS: 9
IGNORE_MISSING_APP: false
LOGIN_HOST: [*****]
MISSING_APP_CACHE_INVALIDATE_TTL: 300s
NOZZLE_MEMORY: 256M
ORG_SPACE_CACHE_INVALIDATE_TTL: 300s
ROOT: $HOME
SCALE_OUT_NOZZLE: 12
SCHEME: https
SECURITY_USER_NAME: [*****]
SECURITY_USER_PASSWORD: [*****]
SKIP_SSL_VALIDATION_CF: false
SKIP_SSL_VALIDATION_SPLUNK: false
SPLUNK_HOST: [*****]
SPLUNK_INDEX: [*****]
SPLUNK_TOKEN: [*****]
STATUS_MONITOR_INTERVAL: 0s
UAA_HOST: [*****]
VERIFY_SSL: false

Running Environment Variable Groups:
FOUNDATION: DEV

No staging env variables have been set

and its running fine with 2.11.8-build.13

kashyap-splunk commented 2 years ago

@gaigaslab-operations for auth, app uses the Client ID/secret provided to get token from uaa. Here, the Client ID and secret pair is generated using either 'uaac' or deployment manifest. (For ref: pivotal docs or Github docs)

And, I have attached the env details of the setup I used for testing. PCF Splunk nozzle test environment.pdf

Can you try generating new client id/secret pair using above steps and use them in your test/dev setup ?

gaigaslab-operations commented 2 years ago

@kashyap-splunk We had already created a uaa oauth client and were using it with 1.2.2. When we upgraded to 1.2.3, that's where the problem started. When we backlevel to 1.2.2, it works again.

The differences between my environment and that of @JuergenSu are as follows:

(my configuration on the left, @JuergenSu in parenthesis following)

ADD_APP_INFO: ["AppName", "OrgName", "SpaceName"] (["AppName", "OrgName", "OrgGuid", "SpaceName", "SpaceGuid"]) APP_CACHE_INVALIDATE_TTL: 0s (60s) APP_LIMITS: 0 (5000) ENABLE_EVENT_TRACING: false (true) FIREHOSE_SUBSCRIPTION_ID: (splunk-firehose) FLUSH_INTERVAL: 2s (5s) HEC_BATCH_SIZE: 100 (250) HEC_RETRIES: 3 (5) HEC_WORKERS: 8 (9) IGNORE_MISSING_APP: true (false) MISSING_APP_CACHE_INVALIDATE_TTL: 0s (300s) ORG_SPACE_CACHE_INVALIDATE_TTL: 72h (300s)

I modified my config to match, restaged, and still had the problem.

My go buildpack is: go_buildpack-cached-cflinuxfs3-v1.9.37+1634589969.zip

I validated that the CLIENT_ID and CLIENT_SECRET are correct by retrieving a token through curl:

curl -d 'grant_type=client_credentials&client_id=humio-splunk-nozzle&client_secret=*&token_format=jwt&response_type=token' https://uaa.sys.**/oauth/token {"access_token":"","token_type":"bearer","expires_in":43199,"scope":"cloud_controller.admin_read_only doppler.firehose","jti":"2914ccb3b97e4f29967b38b34e8db2e7"}

Now, I do have special characters in the secret such as *!#$&@^. Is it possible that these are problematic?

gaigaslab-operations commented 2 years ago

That was it!

I changed the CLIENT_SECRET on the uaac client and in the config (to match) and we can now start and run 1.2.3.

There must have been a change in the CLIENT_SECRET handling so that special characters are no longer handled correctly.

It would be good to get this fixed or documented. Any valid UAA client secret should be valid in the firehose-nozzle

kashyap-splunk commented 2 years ago

Yes, I am able to reproduce this now with special characters. And I verified it was not there in 1.2.2.

We have not changed anything related to special characters handling. But from initial analysis, it seems that this is due to an issue in the specific version of the Oauth2 library used by 'cfclient' library we have used for authentication. (which was automatically determined by Go mods during migration from Glide). I will dig deeper and update.

luckyj5 commented 2 years ago

Fixed and available with 1.2.4 release