HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
19.49k stars 2.42k forks source link

Syncing External Storage lead to Gateway Timeout #5890

Open haiminh2001 opened 6 months ago

haiminh2001 commented 6 months ago

Describe the bug I deployed Label Studio on out internal K8s cluster using the official helm-chart (version 14.0.10). When I sync data storage (a minio cluster on the same k8s cluster) which contains about 1000 json file which represent 1000 tasks (the data is a uri to the minio cluster). After a while, the Gateway Timeout error and occasionally, "in time progress" error pop up, the sync request is not converted in "Failed" status but get stuck in "Queued" status and the project is completely frozen, I cannot do anything without syncing.

To Reproduce Steps to reproduce the behavior:

  1. Go to Project Settings / Cloud Storage
  2. Click on "Sync Storage"
  3. Gateway Timeout / In time progress
  4. Sync task is Failed/Queued

Expected behavior The sync is executed background, there is no need to immediately return the sync task's result.

Environment (please complete the following information):

Additional context This question may be out of context. But I am having issue with performance of Label Studio, the UI is quite lag and syncing data consistently gives me error as in this issue. How can I scale up my Label Studio on K8s and what should I scale up ? The hardware resource is not really my concern.

makseq commented 6 months ago

What do you have in these json files? Are they big?

Label Studio Community doesn't have background workers and all background processes are running on wsgi web workers, so time is limited to 90 seconds for them.

haiminh2001 commented 6 months ago

What do you have in these json files? Are they big?

Label Studio Community doesn't have background workers and all background processes are running on wsgi web workers, so time is limited to 90 seconds for them.

Hi @makseq , thank you for your fast response. The json files are small. They only contains the URI links of the images and the classification labels.

makseq commented 6 months ago

You can try setting

UWSGI_WORKER_HARAKIRI=0

to avoid timeout.

haiminh2001 commented 6 months ago

harakiri A feature of uWSGI that aborts workers that are serving requests for an excessively long time. Configured using the harakiri family of options. Every request that will take longer than the seconds specified in the harakiri timeout will be dropped and the corresponding worker recycled.

Hmm so this means it will disable not only syncing but every other requests timeout, won't it ? If so it is a little bit dangerous. As I mention, I have the resources to scale up (CPU and memory), can scaling up be the solution?

Update: After setting the UWSGI_WORKER_HARAKIRI = 0 in the environment variables, I still get the Gateway Timeout. Is it supposed to be set in the environment variables ?

haiminh2001 commented 6 months ago
Traceback (most recent call last):
  File "/label-studio/label_studio/./io_storages/base_models.py", line 456, in sync
    import_sync_background(self.__class__, self.id)
  File "/label-studio/label_studio/./io_storages/base_models.py", line 485, in import_sync_background
    storage.scan_and_create_links()
  File "/label-studio/label_studio/./io_storages/s3/models.py", line 148, in scan_and_create_links
    return self._scan_and_create_links(S3ImportStorageLink)
  File "/label-studio/label_studio/./io_storages/base_models.py", line 364, in _scan_and_create_links
    self.info_set_in_progress()
  File "/label-studio/label_studio/./io_storages/base_models.py", line 85, in info_set_in_progress
    raise ValueError(f'Storage status ({self.status}) must be QUEUED to move it IN_PROGRESS')
ValueError: Storage status (initialized) must be QUEUED to move it IN_PROGRESS

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/rest_framework/views.py", line 506, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/django/utils/decorators.py", line 43, in _wrapper
    return bound_method(*args, **kwargs)
  File "/label-studio/label_studio/./io_storages/api.py", line 110, in post
    storage.sync()
  File "/label-studio/label_studio/./io_storages/base_models.py", line 458, in sync
    storage_background_failure(self)
  File "/label-studio/label_studio/./io_storages/base_models.py", line 515, in storage_background_failure
    storage.info_set_failed()
  File "/label-studio/label_studio/./io_storages/base_models.py", line 117, in info_set_failed
    self.meta['duration'] = (time_failure - self.time_in_progress).total_seconds()
  File "/label-studio/label_studio/./io_storages/base_models.py", line 96, in time_in_progress
    return datetime.fromisoformat(self.meta['time_in_progress'])
KeyError: 'time_in_progress'

I just ran into the "time in progress" error while syncing again, here is the stack trace, hope it will help.

makseq commented 6 months ago

Hmm so this means it will disable not only syncing but every other requests timeout, won't it ? If so it is a little bit dangerous. As I mention, I have the resources to scale up (CPU and memory), can scaling up be the solution?

Correct, it may eat all your resources.

Update: After setting the UWSGI_WORKER_HARAKIRI = 0 in the environment variables, I still get the Gateway Timeout. Is it supposed to be set in the environment variables ?

Probably you have some balancers like nginx and they throw timeouts.

sajarin commented 5 months ago

@haiminh2001 were you able to get the syncing to external storage working without the gateway timeouts?

haiminh2001 commented 5 months ago

@haiminh2001 were you able to get the syncing to external storage working without the gateway timeouts?

No, I have not. Syncing storage is terribly slow that my approach is to keep my folders in minio storage having no more than 1000 tasks.

WillieMaddox commented 1 month ago

I ran into this problem about a year ago when my project grew to over a few thousand tasks. Here is how I increased the timeout.

In deploy/uwsgi.ini, replace this line:

http-timeout = 300

with this,

if-env = UWSGI_HTTP_TIMEOUT
http-timeout = $(UWSGI_HTTP_TIMEOUT)
endif =
if-not-env = UWSGI_HTTP_TIMEOUT
http-timeout = 300
endif =

In deploy/default.conf, set proxy_read_timeout to whatever you want the timeout to be (for this example I'll use 180):

        proxy_read_timeout 180;

Add the following lines to the docker-compose.yml:

services:

  nginx:
    volumes:
      - ./deploy/default.conf:/etc/nginx/nginx.conf

  app:
    environment:
      - UWSGI_HTTP_TIMEOUT=180
      - UWSGI_WORKER_HARAKIRI=181
    volumes:
      - ./deploy/uwsgi.ini:/label-studio/deploy/uwsgi.ini

Set UWSGI_HTTP_TIMEOUT equal to proxy_read_timeout in deploy/default.conf, and set the UWSGI_WORKER_HARAKIRI to whatever that number is plus 1. The changes should take effect the next time you run docker compose up.

haiminh2001 commented 1 month ago

@WillieMaddox thank you so much. I'll give it a try.

ieddu commented 1 month ago

Thank you very much @WillieMaddox , It worked well for me.