HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
18.38k stars 2.31k forks source link

Syncing new items from S3 storage seems to be slow #4901

Open SheldonWBM opened 11 months ago

SheldonWBM commented 11 months ago

Is your feature request related to a problem? Please describe. Syncing seems to be a slow process.

Describe the solution you'd like When new items are added to the "Source Cloud Storage", the user needs to press "Sync Storage". This re-syncs all items starting at 0, not only the newly added items. There should be an option to import only new tasks that currently do not exist from cloud storage. Perhaps a caching method or, to sync only new items based on timestamp or other metadata. The user could have the option to use the traditional sync vs. the fast sync.

Additional context Currently have ~22,000 items in source syncing.

hogepodge commented 11 months ago

@SheldonWBM we're looking into this internally. I believe that open source doesn't have a queue-based approach to syncing, so it indeed does do a complete reload of the source. I've created a product ticket to request this improvement.

noahlibby17 commented 8 months ago

Hello @hogepodge , I wanted to follow-up here to see if there has been any progress on this. We sync a growing base of tasks (on Enterprise and Community versions of Label Studio) from S3 and the sync has increased to take over an hour now (and is continually increasing daily). Is there an ETA for adding some sort of queue-based sync? Do you have any work-arounds for this in the meantime? Thank you!

SheldonWBM commented 3 months ago

Hi @noahlibby17, I decided to try the sync command today and immediately regretted it. I do have a workaround (which I should have used).

  1. Upload all your assets to s3.
  2. Create a JSON file that represents the tasks that will be created for the assets uploaded to s3. Note: This was based on the label-studio-converter converting YOLO format to JSON for importing.
    [
    {
        "data": {
            "image": "s3://my_path/filename.jpg"
        },
        "annotations": []
    },
    {
        "data": {
            "image": "s3://my_path/filename2.jpg"
        },
        "annotations": []
    }
    ]
  3. Import in the frontend as if uploading a local image.
  4. Tasks will appear for the files you uploaded, and database entries for the tasks will be created.

The only issue I have encountered, which might have me revert the database to earlier today, is that it might create duplicate tasks when you "sync" with s3 cloud storage. If it creates duplicate tasks, you can revert the database or delete the duplicate task entries in the database.

Note: The preferred database to use with label-studio, for large projects, is PostgreSQL.

noahlibby17 commented 3 months ago

Thank you so much @SheldonWBM!