apache / openwhisk

Apache OpenWhisk is an open source serverless cloud platform
https://openwhisk.apache.org/
Apache License 2.0
6.54k stars 1.17k forks source link

Question : Controllers standby test #5416

Open p-jonghyun opened 1 year ago

p-jonghyun commented 1 year ago

https://github.com/apache/openwhisk/blob/ba871e59f7b77f02689a13e4e24e438645d67a47/tests/src/test/scala/ha/ShootComponentsTests.scala#L142-L145

'use controller1 if controller0 goes down' test in ShootComponentsTests.scala does following procedures.

  1. restart controller0 container
  2. /ping controller0 until it’s down
  3. /ping controller1 to check it’s still up
  4. ( Invoke Action(POST) + Get Action(GET) ) * 96 to nginx

Isn't there a case when nginx will forward the first POST request to controller0 that is not ready to take requests? ( in state container up, but the backend in the container not)

Such behaivor will result in connection reset by peer which is not in Swagger Spec, failing the test.

style95 commented 1 year ago

I believe the test case does not guarantee no request is sent to the failed controller. It allows some level of unsuccessful requests. https://github.com/apache/openwhisk/blob/ba871e59f7b77f02689a13e4e24e438645d67a47/tests/src/test/scala/ha/ShootComponentsTests.scala#L172

p-jonghyun commented 1 year ago

@style95 Thanks for the response!

As far as I understood this test case allows unsuccessful invokes only if their response matched the swagger spec. https://github.com/apache/openwhisk/blob/ba871e59f7b77f02689a13e4e24e438645d67a47/tests/src/test/scala/common/rest/WskRestOperations.scala#L1191-L1198

And Connection reset by peer is not one of them which will make the test fail.

style95 commented 1 year ago

Do you see any test failures? In my recent build, it was successful. https://github.com/apache/openwhisk/actions/runs/5029161559/jobs/9020540100#step:4:8830

I am curious about in which condition it failed.

p-jonghyun commented 1 year ago

It looks like OW system test build does not use multiple controllers

https://github.com/apache/openwhisk/blob/ba871e59f7b77f02689a13e4e24e438645d67a47/ansible/environments/local/hosts.j2.ini#L9-L13

Therefore, the test case should be ignored.

https://github.com/apache/openwhisk/blob/ba871e59f7b77f02689a13e4e24e438645d67a47/tests/src/test/scala/ha/ShootComponentsTests.scala#L143-L148

style95 commented 1 year ago

Indeed. We are setting up the environment with the HA mode. https://github.com/apache/openwhisk/blob/ba871e59f7b77f02689a13e4e24e438645d67a47/tools/travis/setupPrereq.sh#L29 But it seems controller1 is always disabled. According to the PR, it looks this is not intended.

I think we need to enable controller1 in the CI environment and fix the system test if required. @p-jonghyun Thank you for reporting this.

style95 commented 1 year ago

It seems the test passed after enabling the second controller. https://github.com/apache/openwhisk/actions/runs/5132045151/jobs/9233975717#step:4:8941