apache / openwhisk

Apache OpenWhisk is an open source serverless cloud platform
https://openwhisk.apache.org/
Apache License 2.0
6.48k stars 1.16k forks source link

How to correctly modify OpenWhisk’s gateway blocking limit from source code (currently 60s) #5467

Open QWQyyy opened 6 months ago

QWQyyy commented 6 months ago

Because our project requires a large number of HTTP request tests and our functions (modern machine learning workflows) have relatively long execution times, our team believes that it is necessary to compile the Docker image of OpenWhisk from the source code and modify the 1min gateway blocking limit. We have made a number of source code modifications for this purpose, as follows:

  1. In the source code part of the controller, we modify application.conf, set request-timeout = 36000s image
  2. In the source code part of the controller, we also modified reference.conf: image
  3. In the common Scala module, we also modified application.conf: image image After completing the above three modifications, we used ./gradlew :core:controller:distDocker to compile the Docker image and replaced the image used by our K8s OpenWhisk cluster. We modified value.yaml .

We ensure that the content written in values.yaml meets the requirements: image We also made some modifications to nginx, including the default.conf inside the nginx:1.21.1 image, and the read-only nginx.conf configured using nginx-configmap: image image However, when we used curl to test, we found that it did not seem to work completely. The blocking of the nginx gateway was overcome, but the following information was printed: image So, what modifications do we need to make to overcome this problem? Looking forward to your guidance!

QWQyyy commented 6 months ago

@style95 Could you please give me some guidance?

style95 commented 6 months ago

It seems you are using the API gateway, could you check again without the API gateway first?

QWQyyy commented 6 months ago

It seems you are using the API gateway, could you check again without the API gateway first?

image

style95 commented 6 months ago

You are supposed to be able to invoke the action with wsk. I think that's the starting point to look into.

QWQyyy commented 6 months ago

You are supposed to be able to invoke the action with wsk. I think that's the starting point to look into.

I'm sure that my gateway can correctly ensure that the end-to-end response is greater than 60 seconds. I have also explicitly configured the controller. I can't seem to find any more places where I need to configure the timeout. Can you give me some suggestions?

QWQyyy commented 6 months ago

wsk does work, but only the information recorded by the activation can be viewed. We prefer to complete our services directly through the gateway HTTP request.

QWQyyy commented 6 months ago

image

QWQyyy commented 6 months ago

It seems that I should also pay attention to apigateway: image

QWQyyy commented 6 months ago

image At the same time, how should I correctly configure the execution time limit? What I wrote in value.yaml is 500 minutes, but when I use wsk to set the upper limit of execution time of 500 minutes, the invoker log prints 600s?

QWQyyy commented 6 months ago

I am currently studying the source code of openwhisk in depth, and I hope to make some solid changes.

style95 commented 6 months ago

@QWQyyy First, IIRC, the timeout of the Kubernetes client in the above log is related to the pod creation. It's not related to the execution of an activation. The action timeout controls the execution timeout against the pod(container). https://github.com/apache/openwhisk/blob/5529cc49d31f135dfdac4f2a2072ca46bfd754de/core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/ContainerProxy.scala#L834

I think you need to ensure you can invoke your action successfully with the wsk action invoke command. If you can successfully invoke your action without the API gateway, then the culprit is the API gateway. If your action is invoked well but it is changed to the asynchronous(get 202 response) at some point, it's related to the controller configuration. If you can't even invoke your activation in the asynchronous mode as well, you may not configure the action timeout properly.

QWQyyy commented 6 months ago

@QWQyyy First, IIRC, the timeout of the Kubernetes client in the above log is related to the pod creation. It's not related to the execution of an activation. The action timeout controls the execution timeout against the pod(container).

https://github.com/apache/openwhisk/blob/5529cc49d31f135dfdac4f2a2072ca46bfd754de/core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/ContainerProxy.scala#L834

I think you need to ensure you can invoke your action successfully with the wsk action invoke command. If you can successfully invoke your action without the API gateway, then the culprit is the API gateway. If your action is invoked well but it is changed to the asynchronous(get 202 response) at some point, it's related to the controller configuration. If you can't even invoke your activation in the asynchronous mode as well, you may not configure the action timeout properly.

Okay let's try it!

QWQyyy commented 6 months ago

@QWQyyy First, IIRC, the timeout of the Kubernetes client in the above log is related to the pod creation. It's not related to the execution of an activation. The action timeout controls the execution timeout against the pod(container).

https://github.com/apache/openwhisk/blob/5529cc49d31f135dfdac4f2a2072ca46bfd754de/core/invoker/src/main/scala/org/apache/openwhisk/core/containerpool/ContainerProxy.scala#L834

I think you need to ensure you can invoke your action successfully with the wsk action invoke command. If you can successfully invoke your action without the API gateway, then the culprit is the API gateway. If your action is invoked well but it is changed to the asynchronous(get 202 response) at some point, it's related to the controller configuration. If you can't even invoke your activation in the asynchronous mode as well, you may not configure the action timeout properly.

It is true that I can successfully access my functions using wsk, but it is only limited to functions within 60 seconds. For functions that take longer to execute, the wsk client also returns an Oops--504 error, but I found through the resource manager htop that the function code is still During execution, after the function is executed, you can see from wsk activation that the function ends correctly. This confuses me.