Haufe-Lexware / wicked.haufe.io

An API Management system based on Mashape Kong
http://wicked.haufe.io
Other
123 stars 37 forks source link

RC6 - kong adapter getting restarted frequently while we have 2 instance kong running in different kuberneties cluster. #207

Closed karthiknaga87 closed 4 years ago

karthiknaga87 commented 5 years ago

please find the logs below

2019-07-01T13:28:29.952Z wicked-sdk Attempting to GET http://prd -developer-portal-wicked-api.prd.svc.cluster.local:3001/applications/sample/subscriptions 2019-07-01T13:28:29.952Z wicked-sdk apiGet(): applications/sample 2019-07-01T13:28:29.952Z wicked-sdk apiAction(GET): applications/sample 2019-07-01T13:28:29.952Z wicked-sdk apiAction: Using scope read_plans read_groups read_content health read_templates read_auth_servers read_apis read_users write_users login read_applications write_applications read_subscriptions write_subscriptions read_approvals read_verifications write_verifications webhooks read_registrations write_registrations read_namespaces write_namespaces read_grants write_grants restart_api 2019-07-01T13:28:29.952Z wicked-sdk Picking up machine user id: b6e01df116b5c3152f2878fcce7fefd9857c32c4 2019-07-01T13:28:29.952Z wicked-sdk getInternalApiUrl() 2019-07-01T13:28:29.952Z wicked-sdk GET http://prd-developer-portal-wicked-api.prd.svc.cluster.local:3001/applications/sample 2019-07-01T13:28:29.952Z wicked-sdk Using User-Agent: wicked.portal-kong-adapter/1.0.0-rc.6 2019-07-01T13:28:29.952Z wicked-sdk Attempting to GET http://prd-developer-portal-wicked-api.prd.svc.cluster.local:3001/applications/sample 2019-07-01T13:28:29.952Z wicked-sdk apiGet(): applications/test/subscriptions 2019-07-01T13:28:29.952Z wicked-sdk apiAction(GET): applications/test/subscriptions 2019-07-01T13:28:29.952Z wicked-sdk apiAction: Using scope read_plans read_groups read_content health read_templates read_auth_servers read_apis read_users write_users login read_applications write_applications read_subscriptions write_subscriptions read_approvals read_verifications write_verifications webhooks read_registrations write_registrations read_namespaces write_namespaces read_grants write_grants restart_api 2019-07-01T13:28:29.952Z wicked-sdk Picking up machine user id: b6e01df116b5c3152f2878fcce7fefd9857c32c4 2019-07-01T13:28:29.952Z wicked-sdk getInternalApiUrl() 2019-07-01T13:28:29.952Z wicked-sdk GET http://prd-developer-portal-wicked-api.prd.svc.cluster.local:3001/applications/test/subscriptions 2019-07-01T13:28:29.952Z wicked-sdk Using User-Agent: wicked.portal-kong-adapter/1.0.0-rc.6 2019-07-01T13:28:29.952Z wicked-sdk Attempting to GET http://prd-developer-portal-wicked-api.prd.svc.cluster.local:3001/applications/test/subscriptions 2019-07-01T13:28:29.952Z wicked-sdk apiGet(): applications/test 2019-07-01T13:28:29.952Z wicked-sdk apiAction(GET): applications/test 2019-07-01T13:28:29.952Z wicked-sdk apiAction: Using scope read_plans read_groups read_content health read_templates read_auth_servers read_apis read_users write_users login read_applications write_applications read_subscriptions write_subscriptions read_approvals read_verifications write_verifications webhooks read_registrations write_registrations read_namespaces write_namespaces read_grants write_grants restart_api 2019-07-01T13:28:29.952Z wicked-sdk Picking up machine user id: b6e01df116b5c3152f2878fcce7fefd9857c32c4 2019-07-01T13:28:29.952Z wicked-sdk getInternalApiUrl() 2019-07-01T13:28:29.952Z wicked-sdk GET http://prd-eih-developer-portal-wicked-api.prd.svc.cluster.local:3001/applications/test 2019-07-01T13:28:29.952Z wicked-sdk Using User-Agent: wicked.portal-kong-adapter/1.0.0-rc.6 2019-07-01T13:28:29.952Z wicked-sdk Attempting to GET http://prd-eih-developer-portal-wicked-api.prd.svc.cluster.local:3001/applications/test error: [+ 199ms] kong-adapter:kong-adapter Could not initialize Kong adapter. /usr/src/app/dist/bin/kong-adapter.js:56 throw err; ^ Error: kongAction GET on consumers/392ecf2f-0b9b-4149-aeca-4abbcb2949e6/hmac-auth did not return the expected status code (got: 500, expected: 200). at Request._callback (/usr/src/app/dist/kong/utils.js:240:29) at Request.self.callback (/usr/src/app/node_modules/request/request.js:185:22) at Request.emit (events.js:198:13) at Request. (/usr/src/app/node_modules/request/request.js:1161:10) at Request.emit (events.js:198:13) at IncomingMessage. (/usr/src/app/node_modules/request/request.js:1083:12) at Object.onceWrapper (events.js:286:20) at IncomingMessage.emit (events.js:203:15) at endReadableNT (_stream_readable.js:1129:12) at process._tickCallback (internal/process/next_tick.js:63:19) npm ERR! code ELIFECYCLE npm ERR! errno 1 npm ERR! portal-kong-adapter@1.0.0-rc.6 start: node ./dist/bin/kong-adapter.js npm ERR! Exit status 1 npm ERR! npm ERR! Failed at the portal-kong-adapter@1.0.0-rc.6 start script. npm ERR! This is probably not a problem with npm. There is likely additional logging output above. npm ERR! A complete log of this run can be found in: npm ERR! /home/wicked/.npm/_logs/2019-07-01T13_28_29_996Z-debug.log

DonMartin76 commented 5 years ago

How does the setup of this look like? Apparently, Kong is returning a 500 for some reason, but it's not obvious what that reason is. Can you also check the logs of Kong please?

fromanu commented 5 years ago

We experienced the same behaviour (kong adapter is restarting frequently). We have 2 instance of kong (and wicked) running in 2 different clusters and talking to the same Postgres instance, but the return code is 400 not 500. Here are the logs:

info: [+3565ms] kong-adapter:monitor           Monitor: Kong is answering.
error: [+ 601ms] kong-adapter:kong-adapter      Could not initialize Kong adapter.
/usr/src/app/dist/bin/kong-adapter.js:56
            throw err;
            ^

Error: kongAction GET on services/search/plugins?consumer_id=7fe72717-647b-40d7-b3ef-db0539dc172b did not return the expected status code (got: 400, expected: 200).
    at Request._callback (/usr/src/app/dist/kong/utils.js:240:29)
    at Request.self.callback (/usr/src/app/node_modules/request/request.js:185:22)
    at Request.emit (events.js:198:13)
    at Request.<anonymous> (/usr/src/app/node_modules/request/request.js:1161:10)
    at Request.emit (events.js:198:13)
    at IncomingMessage.<anonymous> (/usr/src/app/node_modules/request/request.js:1083:12)
    at Object.onceWrapper (events.js:286:20)
    at IncomingMessage.emit (events.js:203:15)
    at endReadableNT (_stream_readable.js:1129:12)
    at process._tickCallback (internal/process/next_tick.js:63:19)
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! portal-kong-adapter@1.0.0-rc.6 start: `node ./dist/bin/kong-adapter.js`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the portal-kong-adapter@1.0.0-rc.6 start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/wicked/.npm/_logs/2019-07-01T14_28_15_707Z-debug.log

Also the list of pods where we can see that the kong-adapter pod is restarting frequently:

apim-wicked-api-5d589554f6-q6bft           1/1     Running            0          12m
apim-wicked-auth-6dd5cc4f7d-l9z7w          1/1     Running            0          12m
apim-wicked-kong-77675799bd-fhqrw          1/1     Running            0          8m58s
apim-wicked-kong-77675799bd-mjlpg          1/1     Running            0          12m
apim-wicked-kong-adapter-9b488fd88-bzd5m   0/1     CrashLoopBackOff   6          12m
apim-wicked-mailer-775c8b898d-gnk9l        1/1     Running            0          12m
apim-wicked-portal-7f7f976778-z795d        1/1     Running            0          12m
apim-wicked-redis-c546b99cc-p2jfd          1/1     Running            0          12m
DonMartin76 commented 5 years ago

This is what I also had in mind - which is why I wanted to know how the structure of the OP is built up.

karthiknaga87 commented 5 years ago

Hi thanks for the reply, We have 2 instance of Wicked portal and kong running in 2 different kubernetes cluster but both refering to same postgres DB.

In addition I dont find any 500 Error logged in Kong pod. Attaching the last 1k lines of kong po. let me know if more details required.

thanks, Karthik N log-kong-po.txt

DonMartin76 commented 5 years ago

That's not the Kong pod; this is a log extract from the api pod. Interesting bits can be found in the logs of either the kong and/or kong-adapter logs.

DonMartin76 commented 5 years ago

The funny thing is that this should not happen, and usually also doesn't. The main thing is that the api detects that the configuration has changed, and then tells all other instances (sharing the database pointer) to restart. If then the restarted instance calculates a different configuration hash, this will totally ping-pong, like the way you describe.

fromanu commented 5 years ago

Hi Martin,

Yesterday this problem appeared again (the clusters were recreated for a couple of times and the problem seemed to be gone, until yesterday) and I managed to workaround it by deleting both the kong and kong adapter pods, when kuberenes recreated the pods, the kong-adapter pod restarted for 3 times and after that, everything was fine. Please find attached the logs from the kong and kong-adapter pods. kong-logs.txt kong-adapter-logs.txt

fromanu commented 5 years ago

Another interesting thing, I just checked the logs of the API pod and found the error bellow, which disappeared when the kong-adapter pod became healthy and stopped restarting.

{ error: remaining connection slots are reserved for non-replication superuser connections at Connection.parseE (/usr/src/app/node_modules/pg/lib/connection.js:553:11) at Connection.parseMessage (/usr/src/app/node_modules/pg/lib/connection.js:378:19) at TLSSocket.<anonymous> (/usr/src/app/node_modules/pg/lib/connection.js:119:22) at TLSSocket.emit (events.js:198:13) at addChunk (_stream_readable.js:288:12) at readableAddChunk (_stream_readable.js:269:11) at TLSSocket.Readable.push (_stream_readable.js:224:10) at TLSWrap.onStreamRead [as onread] (internal/stream_base_commons.js:94:17) name: 'error', length: 140, severity: 'FATAL', code: '53300', detail: undefined, hint: undefined, position: undefined, internalPosition: undefined, internalQuery: undefined, where: undefined, schema: undefined, table: undefined, column: undefined, dataType: undefined, constraint: undefined, file: 'postinit.c', line: '837', routine: 'InitPostgres' }

DonMartin76 commented 5 years ago

Aha! Now this is actually really interesting - this is probably one of the main reasons for you: Your Postgres database connections are depleted.

This can occur during the following situation: Kong Adapter starts up, and subsequently hammers Kong with lots of requests to Kong's admin port (8001). Kong is somewhat wasteful with its Postgres connections, and can use quite many of those if you do lots of calls to your Kong instance (which Kong Adapter is doing).

If you're running your own Postgres server, upping the max_connections to e.g. 500 is a good idea. If you're running a Postgres as a service, this is trickier, as the version of Kong we are using (0.14.1) does not have any possibility to influence the number of connections Kong is using.

This will come as soon as we have managed to move to Kong 1.2+, there it's possible to limit the number of connections.

fromanu commented 5 years ago

We're using Azure Database for PostgreSQL, which is a database as a service, so I guess that the only thing we can do is to wait until you'll move to Kong 1.2+.

DonMartin76 commented 5 years ago

We're using Azure Database for PostgreSQL, which is a database as a service, so I guess that the only thing we can do is to wait until you'll move to Kong 1.2+.

... or go for a larger instance.

fromanu commented 5 years ago

I scaled up our Postgres as a service instance to 2 vCores and now the maximum connections limit is 100 (with 1 vCore max 50 connections were allowed). It seams that this change solved the issue.

DonMartin76 commented 4 years ago

These issues should be resolved as of rc.12 - except for the DB connection topic.