HumanSignal / label-studio-converter

Tools for converting Label Studio annotations into common dataset formats
https://labelstud.io/
255 stars 132 forks source link

make create_tokens_and_tags robust against missing annotation fields #61

Open tomasohara opened 2 years ago

tomasohara commented 2 years ago

Hi, I ran into a problem with the CONLL converter failing due to a missing label in the annotations.

Although this is a problem with the label studio proper, it would be good for the converter to be more robust. For example, any access to the hash should use h.get('k', default) rather than h['k']. The default should allow the code to produce a reasonable approximation, as follows:

span['labels'][0]
=>
span.get('labels', ["_missing_"])[0]

Alternatively, having better exception handling would be good, provided it can pinpoint the source of the error.

Here's a simple illustration:

create_tokens_and_tags("My dog has ugly fleas",
                       [{'start': 3,
                         'end': 6,
                         'text': 'dog',
                         'type': 'Labels'}])
=>
KeyError: 'labels'

The attached file _bad-annotation.json.txt contains an actual export with the issue.

See extract_tokens_and_tags in the attached file (misc_converter.py.txt), which is in the context of an XML-extension I am working on.

As mentioned in issue #15, this needs work before it is ready for a pull request.

KonstantinKorotaev commented 2 years ago

Hi @tomasohara The issue is fixed in the latest converter package (version 0.0.37) , could you please check if it solves your problem?