cvat-ai / cvat

Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
https://cvat.ai
MIT License
12.23k stars 2.95k forks source link

Attach data to a task: better MIME type detection #8346

Open deltheil opened 2 weeks ago

deltheil commented 2 weeks ago

Actions before raising this issue

Is your feature request related to a problem? Please describe.

Context

I am uploading image files via https://app.cvat.ai/api/docs/#tag/tasks/operation/tasks_create_data (using the client_files parameters).

In my case, my image files are stored on disk in a content-addressable manner mimicking how git store and name files. E.g. typically, a JPEG file could be stored as /var/misc/images/1f/ec4f5cee029f96c1e9eddd09821a51c0a9f80a.

Problem

The problem is related to the CVAT engine MIME type detection which is based on file extensions:

E.g. is_image builds upon https://docs.python.org/3/library/mimetypes.html#mimetypes.guess_type:

def _is_image(path):
    mime = mimetypes.guess_type(path)
    # Exclude vector graphic images because Pillow cannot work with them
    return mime[0] is not None and mime[0].startswith('image') and \
        not mime[0].startswith('image/svg')

tl;dr

In my case, all the uploaded image files get ignored.

Describe the solution you'd like

I think it would be great if MIME type detection could be expanded to support magic detection (file headers), e.g. using https://github.com/ahupp/python-magic or anything equivalent. In other words, do not get limited to file extension based detection (.jpg, etc).

NB.: I am talking about images, but same could be done for other media types of course.

Describe alternatives you've considered

I am forced to rename (add an extension) at upload time (work around).

Additional context

No response

bsekachev commented 2 weeks ago

Hello,

python-magic is significantly slower. We used it in the past, but it was decided to work with extensions.

Additionally, it will not work with cloud storages as CVAT needs to download file content -> much much slower.

deltheil commented 2 weeks ago

python-magic is significantly slower. We used it in the past, but it was decided to work with extensions.

Right, that's a drawback.

Additionally, it will not work with cloud storages as CVAT needs to download file content -> much much slower.

True (perhaps the Content-Type (HTTP header) and/or HEAD requests could be leveraged here - not sure how it's being handled right now).

For context: when using the FiftyOne built-in CVAT integration, this even turns into a bug as _get_job_ids polls forever (and no job is ever returned).