HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
18.29k stars 2.29k forks source link

Can't export to CONLL using valueType="url" settings (NER Annotation) #1890

Open guilhermenoronha opened 2 years ago

guilhermenoronha commented 2 years ago

Describe the bug I'm not sure if this is really a bug, or it is the expected behavior of the label-studio. I imported several text files in a row using the time series option with -valueType="url"- set. The task I'm performing is named entity recognition (NER) in word level with custom tags. When I need to export the annotations, the CONLL file generated has the URLs of the imported files instead of words with each tag annotated.

PS: I tried also the saveTextResult="yes" option, but it didn't provide the expected result.

To Reproduce Steps to reproduce the behavior:

  1. Import some text files using the valueType="url" option as time series.
  2. Annotate some data.
  3. Click on export selecting the CONLL2003 option.

The behavior I got


-DOCSTART- -X- O
/data/upload/3/b5608bfa-61313b676e6be5740da077a0.txt -X- _ O

Expected behavior

-DOCSTART- -X- O
Ouro -X- _ O
Preto -X- _ O
20 -X- _ O
de -X- _ O
fevereiro -X- _ O
de -X- _ O
1866 -X- _ O
Souza -X- _ B-PER
Carvalho -X- _ I-PER
E -X- _ O

Environment (please complete the following information):

KonstantinKorotaev commented 2 years ago

Hi @guilhermenoronha It's expected behavior, you can delete option valueType="url" to get behavior you want.

guilhermenoronha commented 2 years ago

Hi @KonstantinKorotaev, Thanks for your answer. I deleted the option valueType="url" but the old behavior persists. I already imported and annotated the data as time series. Does delete this tag, after all, impact in something? In attachments there is my annotations after the removal of valueType. All seems messed now.

Captura de tela de 2022-01-03 15-32-01

KonstantinKorotaev commented 2 years ago

Hi @guilhermenoronha You will need to reimport all tasks with texts. For example, you can do it this way:

  1. Get back valueType="url" to your config
  2. Export your annotations as json
  3. Download texts with script and save them to new json:

    with open(export_filename, mode='r') as f:
        data = json.load(f)
    
    for each in data:
        url = each['data'][list(each['data'].keys())[0]]
        r = requests.get(url)
        each['data'][list(each['data'].keys())[0]] = r.text
    
    with open(new_filename, mode='w') as f:
        json.dump(data, f)
  4. Import the new file to the project without valueType="url" (you can create a copy of your project)

If you have complicated config - replace cycle instructions

guilhermenoronha commented 2 years ago

Hi @KonstantinKorotaev . Thanks again. I took some time to reply because the code you provided needed some adjustments to my project. It worked smoothly, thanks! I'm going to post the fully functional code here because I believe it may be helpful to someone else.

import json
import requests

with open('your_json_file.json', mode='r', encoding='UTF-8') as f:
        data = json.load(f)

for each in data:
    # You may have to change your url depending on where is hosted.
    url = f"http://localhost:8080{each['data'][list(each['data'].keys())[0]]}"
    # You can Find your token clicking at upper right panel button in Label-Studio and Account & Settings. 
    r = requests.get(url, headers={'Authorization':'Token put_your_token_here', })
    # My text is in Portuguese, so I had to change the encoding to UTF-8.
    r.encoding = 'UTF-8'
    each['data'][list(each['data'].keys())[0]] = r.text

with open('conll.json', mode='w', encoding='UTF-8') as f:
    json.dump(data, f)

Although your solution works well and fits to my problem, I found it a bit clumsy to execute. Expecting a common user who needs to export several tasks in CONLL format, it doesn't seem very natural step to be executed in a production environment, right? Said that, don't you believe it would be best to add as feature request to export TimeSeries texts as CONLL format?

Sincerely.