HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
18.43k stars 2.32k forks source link

JSON_MIN export contains bounding boxes in transcriptions field #4504

Open csanadpoda opened 1 year ago

csanadpoda commented 1 year ago

When exporting labels in the JSON_MIN format, sometimes bounding box information is added to the transcriptions instead of the transcriptions themselves.

To Reproduce Steps to reproduce the behavior:

  1. Go to LabelStudio
  2. Label a bunch of documents
  3. Go back to the project view
  4. Click Export
  5. Choose JSON_MIN
  6. Exported json sometimes has bounding boxes instead of transcriptions in them like this:
{
    "ocr": "\/data\/upload\/1\/filename.jpg",
    "id": 22,
    "bbox": [
      {
        "x": 12.924394307721723,
        "y": 8.806096528365792,
        "width": 17.47186637895714,
        "height": 2.0321761219305667,
        "rotation": 0,
        "rectanglelabels": [
          "Field 1"
        ],
        "original_width": 2480,
        "original_height": 3505
      },
      {
        "x": 3.561753840591374,
        "y": 64.35224386113462,
        "width": 3.3507688945945233,
        "height": 1.3547840812870442,
        "rotation": 0,
        "rectanglelabels": [
          "Field 2"
        ],
        "original_width": 2480,
        "original_height": 3505
      }
    ],
    "transcription": [
      "transcribed text 1\n\f", <------THIS IS CORRECTLY TRANSCRIBED TEXT!
      { <------THIS SHOULD BE ANOTHER TEXT, NOT A BBOX!
        "x": 13.469188995051033,
        "y": 54.451746318841145,
        "width": 7.898240965829942,
        "height": 1.5241320914479173,
        "rotation": 0,
        "text": [],
        "original_width": 2480,
        "original_height": 3505
      }
    ],
    "annotator": 4,
    "annotation_id": 21,
    "created_at": "2023-07-04T12:26:00.911143Z",
    "updated_at": "2023-07-05T13:00:04.917034Z",
    "lead_time": 1663.167
  }

Expected behavior I'd expect JSON_MIN to have the following format:

{
    "ocr": "\/data\/upload\/1\/filename.jpg",
    "id": 22,
    "bbox": [
      {
        "x": 12.924394307721723,
        "y": 8.806096528365792,
        "width": 17.47186637895714,
        "height": 2.0321761219305667,
        "rotation": 0,
        "rectanglelabels": [
          "Field 1"
        ],
        "original_width": 2480,
        "original_height": 3505
      },
      {
        "x": 3.561753840591374,
        "y": 64.35224386113462,
        "width": 3.3507688945945233,
        "height": 1.3547840812870442,
        "rotation": 0,
        "rectanglelabels": [
          "Field 2"
        ],
        "original_width": 2480,
        "original_height": 3505
      }
    ],
    "transcription": [
      "transcribed text 1\n\f",
      "transcribed text 2\n\f"
    ],
    "annotator": 4,
    "annotation_id": 21,
    "created_at": "2023-07-04T12:26:00.911143Z",
    "updated_at": "2023-07-05T13:00:04.917034Z",
    "lead_time": 1663.167
  }

Environment (please complete the following information):

Additional context I wonder if this is user error, or what may cause it, as it's impacting my data transformation scripts. I'm not expecting bbox information among my transcriptions. Also, it seems like the bboxes mostly happen when there's no actual transcription added, just empty text. But empty text is also important information in my use case.

hogepodge commented 1 year ago

@csanadpoda Can you share your labeling interface with us? That will help us to determine if this is a Label Studio issue, or a configuration issue.

csanadpoda commented 1 year ago

So I've done some digging, and the thing is it only happens in a specific case. I have Tesseract OCR as my Machine Learning engine wired in via label-studio-ml-backend, and when you add a new label it tries to read the text from within the rectangle. It's all working fine, EXCEPT if you delete something after you've labeled it, that's when the issue arises. So by default if you label an empty space, it puts in a Form Feed character (and gets denoted as \f in the transcription of the JSON_MIN export), BUT if you delete a recognized label's text value, then you also delete the Form Feed character, and it defaults back to the "Recognized Text" placeholder.

So basically these fields are fine:

MicrosoftTeams-image (2)

as even though they look empty, they contain the Form Feed character, however if you go in and delete the content, it regresses back to this: MicrosoftTeams-image (3)

and any fields that look like this will then have a bounding box in the transcription.

So for example the line for the first image in transcriptions would be:

...
"transcription": [
      "\f"
]
...

the one for the second tag would be:

"transcription": [
      {
        "x": 39.68099960967038,
        "y": 34.65627214741319,
        "width": 8.313567362428842,
        "height": 1.2048192771084274,
        "rotation": 0,
        "text": [],
        "original_width": 2480,
        "original_height": 3505
      }
]

But since it's empty on the labeling interface, I'd expect it to be an empty string, not a bounding box dictionary.