HumanSignal / label-studio-converter

Tools for converting Label Studio annotations into common dataset formats
https://labelstud.io/
262 stars 130 forks source link

YOLO Export: download of imported and uploaded images does not work together with LABEL_STUDIO_HOST #105

Open ferenc-hechler opened 2 years ago

ferenc-hechler commented 2 years ago

We have installed Label-Studio in a Kubernetes cluster and the LABEL_STUDIO_HOST environment varibale is used to configure the path based routing "https://[host]/[path-prefix]", e.g. "https://myhost.com/ls-path".

This means, in the Annotation-JSON-Export the image url is "https://myhost.com/ls-path/data/local-files/?d=testimgs/image1.png" for synchronized images from the local filesystem:

    {
        "id": 1,
        "annotations": [
            ...
        ],
        ...
        "data": {
            "image": "https:\/\/myhost.com\/ls-path\/data\/local-files\/?d=testimgs\/image1.png"
        },
        ...
        "project": 1,
        ...
    },
    {
        "id": 2,
        ...
        "data": {
            "image": "\/ls-path\/\/data\/upload\/1\/6a9a9ea2-image2.png"
        },
    }

For uploaded images the url is set up like this: "/ls-path//data/upload/1/6a9a9ea2-image2.png" I think the double "//" between "ls-path" and "data" is an error which is located somewhere in the upload. But even without the double "//" the correct location of the uploaded image can not be detected.

The problem that we have is, that the prefix added by LABEL_STUDIO_HOST is not recognized. So, instead of copying the files locally the YOLO export downloads the synchronized images from the given URLs. For the local files this would be fine, but unfortunately the download from the public hostname fails due to authentication issues. Maybe this is specific to our settings.

But the uploaded images, which should be copied from the project dir, fail completely, becaus the image path is tried to be download as URL.

When exporting in YOLO format the images are "downloaded" with the method "utils.download()": https://github.com/heartexlabs/label-studio-converter/blob/598420012b5cb6e9cd5283e62887ff33af36d0bb/label_studio_converter/utils.py#L103

The code has a special handling for local and uploaded files:

    is_local_file = url.startswith('/data/') and '?d=' in url
    is_uploaded_file = url.startswith('/data/upload')

So, for our image1.png the variable "is_local_file" should be true and for our image2.png the variable "is_uploaded_file" should be true. In both cases, the variables are false, because of the wrong assumption, that the url startswith "/data/" does not match.

For the is_local_file check, the prefix should be f"{LABEL_STUDIO_HOST}/data/" and for the is_uploaded_file check it should be f"{PATH_PREFIX}/data/", where PATH_PREFIX is only the path prefix from LABEL_STUDIO_HOST, here "/ls-path",

I have no deeper understanding, how the connection between label-studio configuration and label-studio-converter works, but the following code shows a working example, how this problem could be worked around.

I don´t think, that this is the correct solution, but it should point out, what is missing.

So, as a utility function we copied the sources from the label-studio settings: https://github.com/heartexlabs/label-studio/blob/e1111d4708e06e0f5885f397d1b904f146c6aa4c/label_studio/core/settings/base.py#L28

import re
def getEnvParams():
    FORCE_SCRIPT_NAME = None
    # Hostname is used for proper path generation to the resources, pages, etc
    # HOSTNAME = get_env('HOST', '')   # get_env() adds the prefix "LABEL_STUDIO_" or "HEARTEX_"
    HOSTNAME = os.environ.get('LABEL_STUDIO_HOST', '')

    if HOSTNAME:
        if not HOSTNAME.startswith('http://') and not HOSTNAME.startswith('https://'):
            logger.info(
                "! HOST variable found in environment, but it must start with http:// or https://, ignore it: %s", HOSTNAME
            )
            HOSTNAME = ''
        else:
            logger.info("=> Hostname correctly is set to: %s", HOSTNAME)
            if HOSTNAME.endswith('/'):
                HOSTNAME = HOSTNAME[0:-1]

            # for django url resolver
            if HOSTNAME:
                # http[s]://domain.com:8080/script_name => /script_name
                pattern = re.compile(r'^http[s]?:\/\/([^:\/\s]+(:\d*)?)(.*)?')
                match = pattern.match(HOSTNAME)
                FORCE_SCRIPT_NAME = match.group(3)
                if FORCE_SCRIPT_NAME:
                    logger.info("=> Django URL prefix is set to: %s", FORCE_SCRIPT_NAME)

    LOCAL_FILES_DOCUMENT_ROOT = os.environ.get('LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT', '')

    return HOSTNAME, FORCE_SCRIPT_NAME, LOCAL_FILES_DOCUMENT_ROOT 

Now we get the required information for accessing local files:

HOSTNAME="https://myhost.com/ls-path"
FORCE_SCRIPT_NAME="/ls-path"
LOCAL_FILES_DOCUMENT_ROOT ="/path/to/local/files"

The utils.download() method can now be rewritten in the following way:

def download(url, output_dir, filename=None, project_dir=None, return_relative_path=False, upload_dir=None,
             download_resources=True):
    HOSTNAME, FORCE_SCRIPT_NAME, LOCAL_FILES_DOCUMENT_ROOT = getEnvParams()
    print("HOSTNAME=", HOSTNAME, ", FORCE_SCRIPT_NAME=", FORCE_SCRIPT_NAME, ", LOCAL_FILES_DOCUMENT_ROOT=", LOCAL_FILES_DOCUMENT_ROOT)

    is_local_file = url.startswith(f'{HOSTNAME}/data/') and '?d=' in url
    # special handling to fix duplicate "//" before "data"
    if url.startswith(f'{FORCE_SCRIPT_NAME}//data/upload'):
        FORCE_SCRIPT_NAME = f'{FORCE_SCRIPT_NAME}/'
    is_uploaded_file = url.startswith(f'{FORCE_SCRIPT_NAME}/data/upload')

    if is_uploaded_file:
        upload_dir = _get_upload_dir(project_dir, upload_dir)
        filename = url.replace(f'{FORCE_SCRIPT_NAME}/data/upload/', '')
        filepath = os.path.join(upload_dir, filename)
        logger.debug(f'Copy {filepath} to {output_dir}'.format(filepath=filepath, output_dir=output_dir))
        if download_resources:
            shutil.copy(filepath, output_dir)
        if return_relative_path:
            return os.path.join(os.path.basename(output_dir), filename)
        return filepath

    if is_local_file:
        filename, dir_path = url.split(f'{HOSTNAME}/data/', 1)[-1].split('?d=')
        dir_path = str(urllib.parse.unquote(dir_path))
        if not os.path.exists(dir_path):
            if filename == 'local-files/':
                print(dir_path)
                filename = os.path.basename(dir_path)
                dir_path = os.path.dirname(dir_path)
                dir_path = os.path.join(LOCAL_FILES_DOCUMENT_ROOT, dir_path)
            else:
                raise FileNotFoundError(dir_path)
        filepath = os.path.join(dir_path, filename)
        if download_resources:
            shutil.copy(filepath, output_dir)
            if return_relative_path:
                return os.path.join(os.path.basename(output_dir), filename)
        else:
            if return_relative_path:
                raise NotImplementedError()    
        return filepath

    if filename is None:
        ...
    return filepath

For is_uploaded_file==true there are no further changes neccessary, besides the changed startswith() comparison.

For is_local_file==true we had to change much more and struggled somewhat with the meaning of filename and dir_path. I tested the changed code and it works for our examples, but I can not understand how the original code was thought to work.

We tested this with a fresh docker buld from the label-studio develop branch.