Dataset Upload Error (Using the API)

cvat-ai / cvat

Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.

https://cvat.ai

MIT License

11.86k stars 2.9k forks source link

Dataset Upload Error (Using the API) #6525

Closed ganindu7 closed 11 months ago

ganindu7 commented 12 months ago

My actions before raising this issue

[x ] Read/searched the docs
[x ] Searched past issues

I am trying to upload a dataset using the python API

My python version 3.10.11 CVAT version: 2.5

My directory structure looks like this


├── training_1
│   ├── image_2 (n images)
│   └── label_2 (n labels) 
├── training_1.zip (a zip archive of the training_1 directory)
├── test.ipynb (notebook running the code I have listed below)

First I read the documentation at https://opencv.github.io/cvat/docs/api_sdk/sdk/reference/apis/projects-api/ to write the following code

from pprint import pprint
from cvat_sdk.api_client import Configuration, ApiClient, models, apis, exceptions 
from cvat_sdk.api_client.models import *

configuration = Configuration(
    host = "http://cvat.gnet.local:8080",
    username = "test",       # credentials used to login to cvat web
    password = "test_pw"
)

with ApiClient(configuration) as api_client:
    id = 1
    filename = "./training_1.zip"
    format = "KITTI 1.0"
    location = "local"
    use_default_location = True
    dataset_write_request = DatasetWriteRequest(None)

    try:
        (data, response) = api_client.projects_api.create_dataset(
            id,
            filename=filename,
            format=format,
            location=location,
            use_default_location=use_default_location,
            dataset_write_request=dataset_write_request
        )

        pprint(data)
    except exceptions.ApiException as e:
        print("Exception when calling ProjectsApi.create_dataset: %s\n" % e)

I got back:

Exception when calling ProjectsApi.create_dataset: Status Code: 404
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Allow': 'GET, POST, HEAD, OPTIONS', 'Content-Length': '23', 'Content-Type': 'application/vnd.cvat+json', 'Cross-Origin-Opener-Policy': 'same-origin', 'Date': 'Wed, 19 Jul 2023 15:20:09 GMT', 'Referrer-Policy': 'same-origin', 'Server': 'nginx/1.18.0 (Ubuntu)', 'Vary': 'Accept, Origin, Cookie', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'DENY', 'X-Request-Id': 'b8de57ef-1bb4-4774-beb1-38cade4449ef'})
HTTP response body: {"detail":"Not found."}

Then I also tried with rest API calls

created a session object

import requests
import json
session  = requests.Session() # create a session object
host_url = "http://cvat.gnet.local:8080"

then I tried login

base_url = f"{host_url}/auth/login"
login_data = {"username": "test",  "email": "test@testmail.com" ,"password": "test_pw"}
headers = {"accept": "application/vnd.cvat+json", "Content-Type": "application/json", } 
response = session.post(base_url, data=json.dumps(login_data), headers=headers)

# Print status code and response text
print(f'Status code: {response.status_code}')
print(f'Response text: {response.text}')
print(f'CSRF token: {session.cookies["csrftoken"]}')

Status code: 200
Response text: {"key":"59175d8ceea31e384b592a77f435e4361b41704c"}
CSRF token: jpxPp8oDAzrLpte1EkeAbxBuNut9RGNW

Then I tried to upload the dataset again


import requests
import json

# Define the URL, headers and data
url = f'{host_url}/projects/1/dataset/?filename=%22test%22&format=KITTI%201.0&location=local&use_default_location=true'
headers = {
    'accept': '*/*',
    'Content-Type': 'application/json',
    'X-CSRFTOKEN': f'{session.cookies["csrftoken"]}',
    "Authorization": f"Token {token}"

}
data = {
    'dataset_file': 'training_1.zip',
}

# Make the POST request
response = session.post(url, headers=headers, data=json.dumps(data))

# Print status code and response text
print(f'Status code: {response.status_code}')
print(f'Response text: {response.text}')
# prinnt xcrf token
print(f'X-CSRFTOKEN: {session.cookies["csrftoken"]}')

# Print the full request
print(f'Request method: {response.request.method}')
print(f'Request URL: {response.request.url}')
print(f'Request headers: {response.request.headers}')
print(f'Request body: {response.request.body}')

I got another error

Status code: 404
Response text: {"detail":"Not found."}
X-CSRFTOKEN: Q7e8Ec4QwZFPsTNypesO10lvQMYxi8vG
Request method: POST
Request URL: http://cvat.qnap.gnet:8080/api/projects/1/dataset/?filename=%22test%22&format=KITTI%201.0&location=local&use_default_location=true
Request headers: {'User-Agent': 'python-requests/2.28.2', 'Accept-Encoding': 'gzip, deflate', 'accept': '*/*', 'Connection': 'keep-alive', 'Content-Type': 'application/json', 'X-CSRFTOKEN': 'Q7e8Ec4QwZFPsTNypesO10lvQMYxi8vG', 'Authorization': 'Token 59175d8ceea31e384b592a77f435e4361b41704c', 'Cookie': 'csrftoken=Q7e8Ec4QwZFPsTNypesO10lvQMYxi8vG; sessionid=56nf7uz9q4qtfgdnhgl7cr4n0q5k4v1k', 'Content-Length': '34'}
Request body: {"dataset_file": "training_1.zip"}

Expected Behaviour

Dataset being uploaded

Am I doing anything wrong here? I also tried creating a dataset by using https://opencv.github.io/cvat/docs/api_sdk/sdk/reference/apis/projects-api/#example but it failed too (maybe the example is outdated)

Cheers, Ganindu.

zhiltsov-max commented 12 months ago

Hi! CVAT uses a special protocol for data uploading, which includes several requests. If you don't have special requirements to how the requests are sent, please try the high-level API instead. Here you can find an example from tests.

from cvat_sdk import make_client, models

with make_client(...) as client:
    project = client.projects.create_from_dataset(
        spec=models.ProjectWriteRequest(name="project with data"),
        dataset_path="path/to/archive.zip",
        dataset_format="COCO 1.0"
    )

If you need to control requests, please check this example.

JeKaQM commented 12 months ago

Might be an issue with a Json file format, you could try using a different file type that's supported by cvat: https://opencv.github.io/cvat/docs/manual/advanced/formats/

ganindu7 commented 12 months ago

Hi! CVAT uses a special protocol for data uploading, which includes several requests. If you don't have special requirements to how the requests are sent, please try the high-level API instead. Here you can find an example from tests.
from cvat_sdk import make_client, models

with make_client(...) as client:
    project = client.projects.create_from_dataset(
        spec=models.ProjectWriteRequest(name="project with data"),
        dataset_path="path/to/archive.zip",
        dataset_format="COCO 1.0"
    )
If you need to control requests, please check this example.

Thanks for getting back to me,

I will try the high level API as you suggested,

I have two questions.

1) Can I use KITTI 1.0 as the dataset_format?


├── training_1
│   ├── image_2 (n images)
│   └── label_2 (n labels) 
├── training_1.zip (a zip archive of the training_1 directory)
├── test.ipynb (notebook running the code I have listed below)

2) As I'm using the KITTI 1.0 format (and as I've shown above) dataset_path can I use the local path to the .zip archive?

Cheers, Ganindu.

zhiltsov-max commented 12 months ago

Can I use "KITTI 1.0" as the dataset_format?

Sure, please check the uploaded file uses the file layout described here.

As I'm using the KITTI 1.0 format (and as I've shown above) dataset_path can I use the local path to the .zip archive?

Yes, this is the default option in high-level SDK.

ganindu7 commented 11 months ago

Here is my adopted code!

Add a progress bar (install ipywidgets if using a notebook)

from tqdm.notebook import tqdm as tqdm_notebook
from cvat_sdk.core.helpers import TqdmProgressReporter

def make_pbar(file, **kwargs):
    return TqdmProgressReporter(tqdm(file=file, mininterval=0, **kwargs))

def make_notebook_pbar(file, **kwargs):

upload dataset code


kitti_dataset_path = "training_3.zip"

pbar_out = io.StringIO()
pbar = make_notebook_pbar(file=pbar_out)

with make_client(
    host="http://cvat.lol", # cvat server location
    port='8080',
    credentials=("username", "password") #, f"Token {token}"),  # is there a way to do token based authentication here?
) as client:

    # projects = client.projects
    new_project = client.projects.create_from_dataset(
                spec = models.ProjectWriteRequest(name="fancy_project_name"),
                dataset_path = kitti_dataset_path,
                dataset_format = 'KITTI 1.0',
                pbar = pbar,
            )

This code works (almost) . A project gets created in the server and the progress bar goes upto 100% and the dataset gets uploaded. However an error gets thrown in the end.

# traceback from my notebook cell 
     15     # projects = client.projects
---> 16     new_project = client.projects.create_from_dataset(
     17                 spec = models.ProjectWriteRequest(name="fancy_project_name"),
     18                 dataset_path = kitti_dataset_path,
     19                 dataset_format = 'KITTI 1.0',
     20                 pbar = pbar,
     21             )

# traceback from File {my python}/site-packages/cvat_sdk/core/proxies/projects.py:174), in ProjectsRepo.create_from_dataset(self, spec, dataset_path, dataset_format, status_check_period, pbar)

171 self._client.logger.info("Created project ID: %s NAME: %s", project.id, project.name)
    173 if dataset_path:
...
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

am I doing something wrong?

zhiltsov-max commented 11 months ago

The code seems correct, could you please include the full traceback?

ganindu7 commented 11 months ago

Thanks a lot for getting back to me!!

This is when running a standalone python script (please ignore the redundant imports ) I also noticed that the progress-bar is not visible when running the standalone script

import requests
import json
from tabulate import tabulate

from typing import Tuple

import io
import textwrap
from pathlib import Path
from cvat_sdk import make_client, Client, models
from cvat_sdk.api_client import Configuration
from cvat_sdk.api_client import exceptions
from cvat_sdk.core.proxies.projects import Project
from cvat_sdk.core.helpers import TqdmProgressReporter

from util import make_pbar

from tqdm import tqdm

def make_pbar(file, **kwargs):
    return TqdmProgressReporter(tqdm(file=file, mininterval=0, **kwargs))

kitti_dataset_path = "training_6.zip"

pbar_out = io.StringIO()
pbar = make_pbar(file=pbar_out)

with make_client(
    host="http://cvat.app.gnet",
    port='8080',
    credentials=("username", "password") #, f"Token {token}"),
) as client:

    # projects = client.projects
    new_project = client.projects.create_from_dataset(
                spec = models.ProjectWriteRequest(name="nozzlenet-data-7"),
                dataset_path = kitti_dataset_path,
                dataset_format = 'KITTI 1.0',
                pbar = pbar,
            )

Traceback

Traceback (most recent call last):
  File "/mnt/qnap_ganindu/ubuntu_backup/master_dataset/master_dataset/data_manager.py", line 35, in <module>
    new_project = client.projects.create_from_dataset(
  File "/home/g/.pyenv/versions/TAO310/lib/python3.10/site-packages/cvat_sdk/core/proxies/projects.py", line 174, in create_from_dataset
    project.import_dataset(
  File "/home/g/.pyenv/versions/TAO310/lib/python3.10/site-packages/cvat_sdk/core/proxies/projects.py", line 57, in import_dataset
    DatasetUploader(self._client).upload_file_and_wait(
  File "/home/g/.pyenv/versions/TAO310/lib/python3.10/site-packages/cvat_sdk/core/uploading.py", line 317, in upload_file_and_wait
    rq_id = json.loads(response.data).get("rq_id")
  File "/home/g/.pyenv/versions/3.10.11/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/home/g/.pyenv/versions/3.10.11/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/g/.pyenv/versions/3.10.11/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

zhiltsov-max commented 11 months ago

Please make sure:

the project is available in UI after it's created in SDK
the server version is >=2.4.6 (changes are from this PR)
SDK and server versions match

On mismatching versions, use the SDK matching the server version.

is there a way to do token based authentication here?

Here's how to login with a token. ApiClient instance for a Client is available at client.api_client.

I also noticed that the progress-bar is not visible when running the standalone script

The output is redirected in pbar_out = io.StringIO() in the code sample.

ganindu7 commented 11 months ago

Hi Thanks again for the reply!

The project is in the UI
The server (2.5) and sdk (2.5.0) are both the version 2.5

I used the modified code below to import the dataset in chunks (is there a limitation for max size?)

# clear output (e.g. progress bar from previous run)
clear_output()

kitti_dataset_path = "training_2.zip"

pbar_out = io.StringIO()
pbar = make_notebook_pbar(file=pbar_out)

table_headers = ["Project ID", "Name"]

with make_client(**params) as client: # config will be included in the params dictionary 

    try:

        projects = client.projects.list()
        # print(tabulate([[project.id, project.name] for project in projects], headers=table_headers, tablefmt="grid"))

        # get the specific project
        my_project = client.projects.retrieve(3)

        # check if the kitti dataset is in the path specified by 'kitti_dataset_path' and it has label_2 image_2 subfolders.
        if not Path(kitti_dataset_path).exists():
            raise ValueError(f"Dataset path {kitti_dataset_path} does not exist")
        if not check_subfolders_in_zip(kitti_dataset_path, {"label_2", "image_2"}):
            raise ValueError(f"Dataset path {kitti_dataset_path} does not contain label_2 and image_2 subfolders")

        # upload dataset to the project
        smart_sweeper_project.import_dataset(
            format_name="KITTI 1.0",
            filename=kitti_dataset_path,
            pbar=pbar,
            status_check_period=5,
        )
        my_project.fetch()
        pbar.finish()        

    except exceptions.ApiException as e:
        print("Exception when calling ProjectsApi.create_dataset: %s\n" % e)

I get what I want now (Dataset get uploaded It is just the error text that would be great to get rid of)

Here is the error I get in the notebook

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
Cell In[6], line 29
     26     raise ValueError(f"Dataset path {kitti_dataset_path} does not contain label_2 and image_2 subfolders")
     28 # upload dataset to the project
---> 29 my_project.import_dataset(
     30     format_name="KITTI 1.0",
     31     filename=kitti_dataset_path,
     32     pbar=pbar,
     33     status_check_period=5,
     34 )
     35 my_project.fetch()
     42 pass

File ~/.pyenv/versions/3.10.11/envs/TAO310/lib/python3.10/site-packages/cvat_sdk/core/proxies/projects.py:57, in Project.import_dataset(self, format_name, filename, status_check_period, pbar)
     51 """
     52 Import dataset for a project in the specified format (e.g. 'YOLO ZIP 1.0').
     53 """
     55 filename = Path(filename)
---> 57 DatasetUploader(self._client).upload_file_and_wait(
     58     self.api.create_dataset_endpoint,
     59     self.api.retrieve_dataset_endpoint,
     60     filename,
     61     format_name,
     62     url_params={"id": self.id},
     63     pbar=pbar,
     64     status_check_period=status_check_period,
     65 )
     67 self._client.logger.info(f"Annotation file '{filename}' for project #{self.id} uploaded")

File ~/.pyenv/versions/3.10.11/envs/TAO310/lib/python3.10/site-packages/cvat_sdk/core/uploading.py:317, in DatasetUploader.upload_file_and_wait(self, upload_endpoint, retrieve_endpoint, filename, format_name, url_params, pbar, status_check_period)
    313 params = {"format": format_name, "filename": filename.name}
    314 response = self.upload_file(
    315     url, filename, pbar=pbar, query_params=params, meta={"filename": params["filename"]}
    316 )
--> 317 rq_id = json.loads(response.data).get("rq_id")
    318 assert rq_id, "The rq_id was not found in the response"
    320 url = self._client.api_map.make_endpoint_url(retrieve_endpoint.path, kwsub=url_params)

File ~/.pyenv/versions/3.10.11/lib/python3.10/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    341     s = s.decode(detect_encoding(s), 'surrogatepass')
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:
    348     cls = JSONDecoder

File ~/.pyenv/versions/3.10.11/lib/python3.10/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    332 def decode(self, s, _w=WHITESPACE.match):
    333     """Return the Python representation of ``s`` (a ``str`` instance
    334     containing a JSON document).
    335 
    336     """
--> 337     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338     end = _w(s, end).end()
    339     if end != len(s):

File ~/.pyenv/versions/3.10.11/lib/python3.10/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
    353     obj, end = self.scan_once(s, idx)
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

looks like response does not have "rq_id" (is None) and uploading.py:317 is causing the "JSONDecodeError"

response = self.upload_file(
    315     url, filename, pbar=pbar, query_params=params, meta={"filename": params["filename"]}
    316 )
--> 317 rq_id = json.loads(response.data).get("rq_id")

I put the print statement to check if "response_data" carries any data (in cvat_sdk/core/uploading.py)

    317 print(f"DEBUG: response data = {response.data}")
--> 318 rq_id = json.loads(response.data).get("rq_id")

I got

DEBUG: response data = b''

Am I packaging the request incorrectly?

fixing this will help me upload a set of dataset segments as jobs into a single project (we do this to actually audit a dataset that was previously used in training)

So thanks a lot for the continued support!

zhiltsov-max commented 11 months ago

If there is no rq_id in the reply and the status is 202, then the server is probably using some older version. The rq_id response was added in https://github.com/opencv/cvat/pull/5909.

ganindu7 commented 11 months ago

Cheers,

I updated the repo; now at (2896bec3d4f19b392d24e8119ff085793a550b34) confirmed with git rev-parse HEAD
Cleaned up the images docker images -q | xargs docker rmi
Set up envs export CVAT_HOST=my.local.cvat.hostpath and docker-compose overrides to point to storage
Rebuilt and launched with docker compose up -d --build

set up some extra debugging

312         url = self._client.api_map.make_endpoint_url(upload_endpoint.path, kwsub=url_params)
313         params = {"format": format_name, "filename": filename.name}
314         response = self.upload_file(
315             url, filename, pbar=pbar, query_params=params, meta={"filename": params["filename"]}
316         )
317         print(f"DEBUG: response status_code = {response.status}")
318         print(f"DEBUG: response reason = {response.reason}")
319         print(f"DEBUG: response headers = {response.headers}")
320         print(f"DEBUG: response data = {response.data}")

Confirmed the issue is fixed wit the output!

DEBUG: response status_code = 202
DEBUG: response reason = Accepted
DEBUG: response headers = HTTPHeaderDict({'Allow': 'GET, POST, HEAD, OPTIONS', 'Content-Length': '47', 'Content-Type': 'application/vnd.cvat+json', 'Cross-Origin-Opener-Policy': 'same-origin', 'Date': 'Mon, 24 Jul 2023 12:06:11 GMT', 'Referrer-Policy': 'same-origin', 'Server': 'nginx/1.18.0 (Ubuntu)', 'Vary': 'Accept, Origin', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'DENY', 'X-Request-Id': '7bf391c6-6474-4599-80fa-5f4a05264607'})
DEBUG: response data = b'{"rq_id":"import:project-3-dataset-by-ganindu"}'

Thanks a lot Maxim!!

Best, Ganindu.

liudaolunboluo commented 5 months ago

make_client can use token to auth?

zhiltsov-max commented 5 months ago

@liudaolunboluo, answered in #7439

YaoJusheng commented 1 month ago

I also encountered a similar question,

So I want to know how to import a project dataset through rest api? Has anyone ever achieved this completely?

Or does it have to be implemented through SDK as mentioned above?

liudaolunboluo commented 1 month ago

I also encountered a similar question,

So I want to know how to import a project dataset through rest api? Has anyone ever achieved this completely?

Or does it have to be implemented through SDK as mentioned above? 建议最好用sdk实现，用rest api的话非常复杂，因为这用了一个外部的上传组件，整个过程是异步的，你需要考虑去等待结果

zhiltsov-max commented 1 month ago

@YaoJusheng, the uploading is available, it uses the TUS file uploading protocol. You can implement it manually or use CVAT SDK for uploading, as shown in https://github.com/cvat-ai/cvat/issues/6525#issuecomment-1643731268 .

YaoJusheng commented 1 month ago

I also encountered a similar question, So I want to know how to import a project dataset through rest api? Has anyone ever achieved this completely? Or does it have to be implemented through SDK as mentioned above? 建议最好用sdk实现，用rest api的话非常复杂，因为这用了一个外部的上传组件，整个过程是异步的，你需要考虑去等待结果

嗯，看issue是说使用了特殊协议，只不过我们在使用时自己实现了一套基于Rest API的调度管理逻辑来与cvat交互，仅仅是一个接口的话更换sdk有点不合适，我再看看吧

YaoJusheng commented 1 month ago

@YaoJusheng, the uploading is available, it uses the TUS file uploading protocol. You can implement it manually or use CVAT SDK for uploading, as shown in #6525 (comment) .

Ok, thanks for the reply, I will refer to it.

liudaolunboluo commented 1 month ago

I also encountered a similar question, So I want to know how to import a project dataset through rest api? Has anyone ever achieved this completely? Or does it have to be implemented through SDK as mentioned above? 建议最好用sdk实现，用rest api的话非常复杂，因为这用了一个外部的上传组件，整个过程是异步的，你需要考虑去等待结果

嗯，看issue是说使用了特殊协议，只不过我们在使用时自己实现了一套基于Rest API的调度管理逻辑来与cvat交互，仅仅是一个接口的话更换sdk有点不合适，我再看看吧

你的场景和我们一样，我们也是把cvat通过 rest api接入到了自己的系统里，然后需要在自己系统里做上传导入，我的解决方案是通过python做了一个adapter或者说转发，因为我们在上传之前也涉及到用python做自动标注，可以参考一下。因为在官方api文档中没有特别标注有导入上传到，我F12也研究了半天，用的tus这个组件，并且cvat在处理上传数据集的时候异常处理的非常差，xml格式的哪怕是多一个空格换行符都会上传失败，并且不会打印真实的错误信息

YaoJusheng commented 1 month ago

I also encountered a similar question, So I want to know how to import a project dataset through rest api? Has anyone ever achieved this completely? Or does it have to be implemented through SDK as mentioned above? 建议最好用sdk实现，用rest api的话非常复杂，因为这用了一个外部的上传组件，整个过程是异步的，你需要考虑去等待结果

嗯，看issue是说使用了特殊协议，只不过我们在使用时自己实现了一套基于Rest API的调度管理逻辑来与cvat交互，仅仅是一个接口的话更换sdk有点不合适，我再看看吧

你的场景和我们一样，我们也是把cvat通过 rest api接入到了自己的系统里，然后需要在自己系统里做上传导入，我的解决方案是通过python做了一个adapter或者说转发，因为我们在上传之前也涉及到用python做自动标注，可以参考一下。因为在官方api文档中没有特别标注有导入上传到，我F12也研究了半天，用的tus这个组件，并且cvat在处理上传数据集的时候异常处理的非常差，xml格式的哪怕是多一个空格换行符都会上传失败，并且不会打印真实的错误信息

好的，非常感谢，我刚看了一下TUS，结合cvat请求似乎流程很简单，创建资源 -> 检查上传状态 -> 分块上传 -> 断点续传处理