cvat-ai / cvat

Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
https://cvat.ai
MIT License
12.57k stars 3k forks source link

Using serverless functions optimally within the local network (A hybrid approach using Docker and K8s) #6714

Open ganindu7 opened 1 year ago

ganindu7 commented 1 year ago

My actions before raising this issue

Background

I have a cvat docker instance running on a server locally (this server does not belong to a local k8 cluter).

I have run the serverless compose file on the cvat repo, therefore at port 8070 I can see the nuclio dashboard.

Screenshot from 2023-08-21 10-11-55

within cvat (port 8080) models tab I can see the serverless functon.

Screenshot from 2023-08-21 10-07-36

This is all working fine and I can do auto annotation without a problem for the function hosted in the doocker server.

However my cvat server doe not have a GPU so I can't run gpu serverless functions on that particular server. Luckily in the same local network there is a small kubnernetes cluster that has a handful of nodes where one of them happens to be a GPU node.

K8 cluster

Master node-¬
             |- cpu node 1
             |- cpu node 2
             |- gpu node 1
             |- ...

I installed nuclio on my local k8 cluster and I was able to sucessfully run serverless functions.

Screenshot from 2023-08-21 10-38-40

Then I verfied that my serverless functions are working properly utilising the gpu resources avaialble and tested wth a test web app written ro run on my pc (sending an image, recieving annotations, plotting and listing the returned annotations)

Screenshot from 2023-08-21 10-38-17

Note: My PC running the function testing web app is not in the k8 cluster but in the same local network (I exposed the service(serverless function) as a nodeport so it can be accessed from outsuse the k8 cluster but within the same local network where my pc (and the cvat server) is in.

My aim, problem and possible solutions

Use the serverless function I created (in the k8 cluster) with the nuclio dashboard that is on the docker container (I doubt this is possible because the gpu operator is running on the k8 cluster and serverless functions are acting as K8 services (in k8 pods)

or use the function as a URL (from my internet research this is a paid feature and can only be used with cvat.ai, this doesn;t suit me because I want to use everything locally (at least for now until I flesh out things))

or get CVAT to use my k8 nuclio dashboard (I don't understand this well so thi might be illogical) instead the dashboard from the severless docker compose file (I think this might be the most plausible if it make sense at all)

Can you pelase help me on this,

Thanks, Ganindu.

stone100010 commented 1 year ago

The first computer name: Dell-8GPU, which contains 8*3090ti. The second computer name: Dell-CPU, which is i7-13th. I'm looking for a way: Dell-CPU:severless cvat Dell-8GPU:SAM If you have any idea, please contact me. Thank you very much! ! !

stone100010 commented 1 year ago

hi, open it: https://nuclio.io/docs/latest/reference/triggers/http/#attributes (Kubernetes only) Kubernetes ServiceType, used by the Kubernetes service to expose the trigger. The default ServiceType is ClusterIP, which means that by default the trigger won't be exposed outside of the cluster unless you configure a proper ingress or manually change the ServiceType to NodePort. Is it effective?

ganindu7 commented 1 year ago

Screenshot from 2023-09-20 18-19-13 I think these issues are helpful here. https://github.com/opencv/cvat/issues/2301 https://github.com/opencv/cvat/issues/6065 @bsekachev Can you please advice us on this.

once I modify these (CVAT_NUCLIO_HOST and CVAT_NUCLIO_PORT) env variables

NUCLIO = {
    'SCHEME': os.getenv('CVAT_NUCLIO_SCHEME', 'http'),
    'HOST': os.getenv('CVAT_NUCLIO_HOST', 'aisrv.gnet.lan'),
    'PORT': int(os.getenv('CVAT_NUCLIO_PORT', 30936)),
    'DEFAULT_TIMEOUT': int(os.getenv('CVAT_NUCLIO_DEFAULT_TIMEOUT', 120)),
    'FUNCTION_NAMESPACE': os.getenv('CVAT_NUCLIO_FUNCTION_NAMESPACE', 'nuclio'),
    'INVOKE_METHOD': os.getenv('CVAT_NUCLIO_INVOKE_METHOD',
        default='dashboard' if 'KUBERNETES_SERVICE_HOST' in os.environ else 'direct'),
}

do I need to have these in a seperate yaml ? (e.g. the-other-compose-file.yaml)

  cvat_server:
    environment:
      CVAT_SERVERLESS: 1
    extra_hosts:
      - "host.docker.internal:host-gateway"

  cvat_worker_annotation:
    extra_hosts:
      - "host.docker.internal:host-gateway"

or declare in the envs

services:
  cvat_server:
    environment:
      CVAT_SERVERLESS: 1
      CVAT_NUCLIO_SCHEME: http  # Updated value
      CVAT_NUCLIO_HOST: aisrv.gnet.lan  # Updated value
      CVAT_NUCLIO_PORT: 30936  # Updated value
      KUBERNETES_SERVICE_HOST: true
    extra_hosts:
      - "host.docker.internal:host-gateway"

  cvat_worker_annotation:
    extra_hosts:
      - "host.docker.internal:host-gateway"
~                                            

and then run it as

docker compose -f docker-compose.yml  -f docker-compose.override.yml   -f components/serverless/the-other-compose-file.yaml up --build -d

even after deploying like that I get no models :( I think there must be something I'm doing off the specification

cvat not regstering models Screenshot from 2023-09-20 18-08-39

working functions Screenshot from 2023-09-20 18-16-45

k8 services and pods Screenshot from 2023-09-20 18-19-13

ganindu7 commented 1 year ago

hi, open it: https://nuclio.io/docs/latest/reference/triggers/http/#attributes (Kubernetes only) Kubernetes ServiceType, used by the Kubernetes service to expose the trigger. The default ServiceType is ClusterIP, which means that by default the trigger won't be exposed outside of the cluster unless you configure a proper ingress or manually change the ServiceType to NodePort. Is it effective?

yes it is nodeport

does that mean I have to specify individual functions rather than the nuclio dashboard? (all this time i was putting my k8 nuclio dashboard URL and port for CVAT_NUCLIO_HOST and CVAT_NUCLIO_PORT

This is my nuclio dashboard from the k8 cluster Screenshot from 2023-09-20 18-16-45 here are my pods and services

Screenshot from 2023-09-20 18-19-13

just reiterating (cvat is running on a seperate pc running docker! I have exec'd into the django docker pod and made sure name resolution and ping for k8 services are working with nslookup and ping)

ganindu7 commented 1 year ago

finally I was able to make it work! I'm not sure where exactly was the problem was but here is what I did

nuclio-1.8.15.zip

my values yaml file values.yaml.txt

services:
  cvat_server:

    environment:
      CVAT_SERVERLESS: 1
      CVAT_NUCLIO_SCHEME: http  # Updated value
      CVAT_NUCLIO_HOST: aisrv.gnet.lan  # Updated value
      CVAT_NUCLIO_PORT: 30936 # Updated value
      KUBERNETES_SERVICE_HOST: true
      CVAT_NUCLIO_FUNCTION_NAMESPACE: nuclio

    volumes:
      - cvat_data:/home/django/data:rw

    extra_hosts:
      - "host.docker.internal:host-gateway"

  cvat_worker_annotation:
    extra_hosts:
      - "host.docker.internal:host-gateway"

my updates to cvat server.

I understand that because I am not using the docker dashboard I may not need to use that specific version but at this point I just wanted things to work as the cvat team may have tested with the shipped version.

(also remember to use docker-buildx)

ganindu7 commented 1 year ago

This came back in release v2.7.6 again. despite me updating nuctl / nuclio images to 1.11.24 in all places (dashboard/, controller and nuctl)

this issue us very similar to #6582

I commented with my temporary hack fix

I had the same problem after updating cvat.

My difference is cvat running from a docker container and nuclio running from a kubernetes cluster.

my deployment looks like this

(TAOPY) g@nvdgx:~/Workspace/sandbox/nuclio-serverless-sandbox/ganindu-tests$ ./deploy.sh nozzlenet_1/
23.10.17 10:17:33.441                     nuctl (I) Project created {"Name": "cvat", "Namespace": "nuclio"}
Deploying . function...
23.10.17 10:17:33.561                     nuctl (I) Deploying function {"name": "test-nuctl-function-nozzlenet-1"}
23.10.17 10:17:33.566                     nuctl (I) Building {"builderKind": "docker", "versionInfo": "Label: 1.11.24, Git commit: f2a3900d23b92fd3639dc9cb765044ef53a4fb2b, OS: linux, Arch: amd64, Go version: go1.19.10", "name": "test-nuctl-function-nozzlenet-1"}
23.10.17 10:17:33.650                     nuctl (I) Staging files and preparing base images
23.10.17 10:17:33.678                     nuctl (I) Building processor image {"registryURL": "172.16.3.2:5000", "taggedImageName": "nozzlenet-nuclio-v1:latest"}
23.10.17 10:17:33.678     nuctl.platform.docker (I) Pulling image {"imageName": "quay.io/nuclio/handler-builder-python-onbuild:1.11.24-amd64"}
23.10.17 10:17:35.818            nuctl.platform (I) Building docker image {"image": "nozzlenet-nuclio-v1:latest"}
23.10.17 10:17:41.437            nuctl.platform (I) Pushing docker image into registry {"image": "nozzlenet-nuclio-v1:latest", "registry": "172.16.3.2:5000"}
23.10.17 10:17:41.437     nuctl.platform.docker (I) Pushing image {"from": "nozzlenet-nuclio-v1:latest", "to": "172.16.3.2:5000/nozzlenet-nuclio-v1:latest"}
23.10.17 10:17:42.850            nuctl.platform (I) Docker image was successfully built and pushed into docker registry {"image": "nozzlenet-nuclio-v1:latest"}
23.10.17 10:17:42.850                     nuctl (I) Build complete {"image": "nozzlenet-nuclio-v1:latest"}
23.10.17 10:17:50.882                     nuctl (I) Function deploy complete {"functionName": "test-nuctl-function-nozzlenet-1", "httpPort": 30555, "internalInvocationURLs": ["nuclio-test-nuctl-function-nozzlenet-1.nuclio.svc.cluster.local:8080"], "externalInvocationURLs": [":30555"]}
23.10.17 10:17:50.888    nuctl.platform.updater (I) Updating function {"name": "test-nuctl-function-nozzlenet-1"}
23.10.17 10:17:51.166    nuctl.platform.updater (I) Function updated {"functionName": "test-nuctl-function-nozzlenet-1"}
 NAMESPACE | NAME                            | PROJECT | STATE | REPLICAS | NODE PORT 
 nuclio    | test-nuctl-function-nozzlenet-1 | cvat    | ready | 1/1      | 30555     

as you can see mine uses ther nodepoprt 30555

my `docker-compose-override.yaml' used to look like this

services:
  cvat_server:

    environment:
      CVAT_SERVERLESS: 1
      CVAT_NUCLIO_SCHEME: "http"  # Updated value
      CVAT_NUCLIO_HOST: "aisrv.gnet.lan"  # Updated value
      CVAT_NUCLIO_PORT: 30936 # Updated value
      KUBERNETES_SERVICE_HOST: "true"
      CVAT_NUCLIO_FUNCTION_NAMESPACE: "nuclio"

    volumes:
      - cvat_data:/home/django/data:rw

    extra_hosts:
      - "host.docker.internal:host-gateway"

  cvat_worker_annotation:
    extra_hosts:
      - "host.docker.internal:host-gateway"

volumes:
  cvat_data:
    driver_opts:
      type: none
      device: /mnt/cvat_data
      o: bind

and the error I was getting was (I will put only a part of it for brevity)

2023-10-17 10:45:54,639 DEBG 'rqworker-annotation-0' stderr output:
[2023-10-17 10:45:54,639] ERROR rq.worker: Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/opt/venv/lib/python3.10/site-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "/opt/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 415, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/opt/venv/lib/python3.10/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 976, in send
    self.connect()
  File "/opt/venv/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/opt/venv/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f20c8f38b80>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/opt/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 798, in urlopen
    retries = retries.increment(
  File "/opt/venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='host.docker.internal', port=30555): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20c8f38b80>: Failed to establish a new connection: [Errno -2] Name or service not known'))

the intersting bit was where it seem to think the nodeport service was hosted in the docker host port 30555

urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='host.docker.internal', port=30555): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20c8f38b80>: Failed to establish a new connection: [Errno -2] Name or service not known'))

so in a very dodgy way I modified the docker-compse.override.yaml file to (which I know is wrong)

  cvat_worker_annotation:
    extra_hosts:
      - "host.docker.internal:172.16.1.19"

172.16.1.19 is the ip address of my k8 control plane

and this partially fixed the issue (now I can automatically anotate jobs/projects) which I was not able to previously due to the error above (but it still does not work for individual images, times out to error 500)

I'm not a docker power user I just think the fix worked only because of some other potential error I made somewhere. can you please help me point out where could the original problem be.

Thanks

can you please suggest a better solutiuon (better than me misconfiguring the docker compose) . (I think my scenario is a combination of two wrongs now acting as as sort of a solution which may only work in a trusted local network setup like mine)

kevle1 commented 3 months ago

Setting KUBERNETES_SERVICE_HOST will result in the INVOKE_METHOD being set to true for Nuclio which might have been the reason why issue was resolved.

This may interest you: https://github.com/cvat-ai/cvat/issues/6797#issuecomment-2272616912

I was encountering similar issues~