cvat-ai / cvat

Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
https://cvat.ai
MIT License
12.32k stars 2.96k forks source link

Uploading from S3, manifest.jsonl file needs to be in the same location as the image data #8077

Open astringfield opened 3 months ago

astringfield commented 3 months ago

A summary of my use-case:

  1. I'm trying to upload data from S3 to a local CVAT instance running in Docker
  2. I'm using the CVAT CLI
  3. I've created and verified manifest.jsonl file

Question

In both cases, I specify the S3 prefix path to where the images are stored, however, the command only works if the manifest is stored in the same S3 location as the image data. If the manifest is elsewhere in S3, the upload fails. Below I've included examples of successful and unsuccessful uploads to illustrate the problem with a concrete example.

Is this behaviour expected, or, is there a way to upload to CVAT from S3 with the manifest file stored separately from the images? I really appreciate any help you can provide.

manifest.jsonl in the same S3 location as images

When I run the command with the manifest.jsonl file stored in the same location in S3 as the images, the upload is successful:

# Command
cvat-cli --auth <cvat_username>:<cvat_password> \
    --server-host http://localhost \
    --server-port 8080 \
    --organization <org_name> \
    create "<task_name>" --use_cache \
    --project_id <proj_id> \
    --annotation_path "/path/to/local/annotations.json" \
    --annotation_format "COCO 1.0" \
    --cloud_storage_id <cloud_id> \
    --filename_pattern "path/to/images/on/s3/*.png" \
    share path/to/images/on/s3/manifest.jsonl

# Output (success)
[2024-06-25 15:46:33] INFO: Created task ID: 227 NAME: <task_name>
[2024-06-25 15:46:33] INFO: Awaiting for task 227 creation...
[2024-06-25 15:46:35] INFO: Task 227 creation status: Finished (message=)
Uploading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 205M/205M [00:01<00:00, 158MB/s]
[2024-06-25 15:46:52] INFO: Annotation file '/path/to/local/annotations.json' for task #227 uploaded
Created task id 227

manifest.jsonl in a different S3 location from images

However, when I run the command with the manifest.jsonl file stored in a different location in S3 from the images, the upload results in error: When I run the command with the manifest.jsonl file stored in the same location in S3 as the images, the upload is successful:

# Command
cvat-cli --auth <cvat_username>:<cvat_password> \
    --server-host http://localhost \
    --server-port 8080 \
    --organization <org_name> \
    create "<task_name>" --use_cache \
    --project_id <proj_id> \
    --annotation_path "/path/to/local/annotations.json" \
    --annotation_format "COCO 1.0" \
    --cloud_storage_id <cloud_id> \
    --filename_pattern "path/to/images/on/s3/*.png" \
    share a/different/location/on/s3/manifest.jsonl

# Output (error)
[2024-06-25 15:44:54] INFO: Created task ID: 225 NAME: <task_name>
[2024-06-25 15:44:54] INFO: Awaiting for task 225 creation...
[2024-06-25 15:44:56] INFO: Task 225 creation status: Failed (message=Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/rq/worker.py", line 1431, in perform_job
    rv = job.perform()
  File "/opt/venv/lib/python3.10/site-packages/rq/job.py", line 1280, in perform
    self._result = self._execute()
  File "/opt/venv/lib/python3.10/site-packages/rq/job.py", line 1317, in _execute
    result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/django/cvat/apps/engine/task.py", line 646, in _create_thread
    media, task_mode = _validate_data(media, manifest_files)
  File "/home/django/cvat/apps/engine/task.py", line 260, in _validate_data
    raise ValueError('No media data found')
ValueError: No media data found)
[2024-06-25 15:44:56] CRITICAL: Status Code: 200
Reason: OK
HTTP response headers: HTTPHeaderDict({'Allow': 'GET, HEAD, OPTIONS', 'Content-Length': '846', 'Content-Type': 'application/vnd.cvat+json', 'Cross-Origin-Opener-Policy': 'same-origin', 'Date': 'Tue, 25 Jun 2024 05:44:56 GMT', 'Referrer-Policy': 'same-origin, strict-origin-when-cross-origin', 'Server': 'nginx', 'Vary': 'Accept, Accept-Encoding, Origin, Cookie', 'X-Content-Type-Options': 'nosniff, nosniff', 'X-Frame-Options': 'DENY, deny', 'X-Request-Id': 'c8ebf596-82bd-4bee-8f45-3583a247db8e'})
HTTP response body: b'{"state":"Failed","message":"Traceback (most recent call last):\\n  File \\"/opt/venv/lib/python3.10/site-packages/rq/worker.py\\", line 1431, in perform_job\\n    rv = job.perform()\\n  File \\"/opt/venv/lib/python3.10/site-packages/rq/job.py\\", line 1280, in perform\\n    self._result = self._execute()\\n  File \\"/opt/venv/lib/python3.10/site-packages/rq/job.py\\", line 1317, in _execute\\n    result = self.func(*self.args, **self.kwargs)\\n  File \\"/usr/lib/python3.10/contextlib.py\\", line 79, in inner\\n    return func(*args, **kwds)\\n  File \\"/home/django/cvat/apps/engine/task.py\\", line 646, in _create_thread\\n    media, task_mode = _validate_data(media, manifest_files)\\n  File \\"/home/django/cvat/apps/engine/task.py\\", line 260, in _validate_data\\n    raise ValueError(\'No media data found\')\\nValueError: No media data found","progress":0.0}'