NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
8.92k stars 1.38k forks source link

json annotations translation (IOB to form) #219

Open iwaqas opened 1 year ago

iwaqas commented 1 year ago

I want to use my own dataset already annotated using IOB format json file. Possible to use it as is? Or do I need to translate it to the json format similar to the FUNSD dataset? If yes, is there any (free) converter/translator available to transform the annotation format? annotation diff

mellahysf commented 9 months ago

@iwaqas did you found any solution/way/hack for that please ?

NielsRogge commented 9 months ago

That's definitely possible, you just need a list of words and corresponding coordinates + labels for each document.

mellahysf commented 9 months ago

Thanks @NielsRogge for your reply.

Let's clarify the task; the FUNSD json schema is as follow:

{ "$schema": "http://json-schema.org/draft-07/schema#", "title": "Generated schema for Root", "type": "object", "properties": { "form": { "type": "array", "items": { "type": "object", "properties": { "box": { "type": "array", "items": { "type": "number" } }, "text": { "type": "string" }, "label": { "type": "string" }, "words": { "type": "array", "items": { "type": "object", "properties": { "box": { "type": "array", "items": { "type": "number" } }, "text": { "type": "string" } }, "required": [ "box", "text" ] } }, "linking": { "type": "array", "items": { "type": "array", "items": { "type": "number" } } }, "id": { "type": "number" } }, "required": [ "box", "text", "label", "words", "linking", "id" ] } } }, "required": [ "form" ] }

While our json schema (one json containing the whole annotations details) is :

{ "$schema": "http://json-schema.org/draft-07/schema#", "title": "Generated schema for Root", "type": "array", "items": { "type": "object", "properties": { "completions": { "type": "array", "items": { "type": "object", "properties": { "created_username": { "type": "string" }, "created_ago": { "type": "string" }, "result": { "type": "array", "items": { "type": "object", "properties": { "original_width": { "type": "number" }, "original_height": { "type": "number" }, "image_rotation": { "type": "number" }, "value": { "type": "object", "properties": { "x": { "type": "number" }, "y": { "type": "number" }, "width": { "type": "number" }, "height": { "type": "number" }, "rotation": { "type": "number" }, "x_px": { "type": "number" }, "y_px": { "type": "number" }, "width_px": { "type": "number" }, "height_px": { "type": "number" }, "rectanglelabels": { "type": "array", "items": { "type": "string" } }, "text": { "type": "array", "items": { "type": "string" } }, "confidence": { "type": "number" } }, "required": [ "x", "y", "width", "height", "rotation", "x_px", "y_px", "width_px", "height_px", "rectanglelabels", "text", "confidence" ] }, "tokens": { "type": "array", "items": { "type": "object", "properties": { "x": { "type": "number" }, "y": { "type": "number" }, "width": { "type": "number" }, "height": { "type": "number" }, "rotation": { "type": "number" }, "text": { "type": "string" } }, "required": [ "x", "y", "width", "height", "rotation", "text" ] } }, "id": { "type": "string" }, "from_name": { "type": "string" }, "to_name": { "type": "string" }, "type": { "type": "string" }, "sections": { "type": "array", "items": {} }, "pageNumber": { "type": "number" } }, "required": [ "original_width", "original_height", "image_rotation", "value", "tokens", "id", "from_name", "to_name", "type", "sections", "pageNumber" ] } }, "honeypot": { "type": "boolean" }, "lead_time": { "type": "number" }, "id": { "type": "number" }, "confidence_range": { "type": "array", "items": { "type": "number" } }, "updated_at": { "type": "string" }, "updated_by": { "type": "string" }, "submitted_at": { "type": "string" }, "edit_time": { "type": "number" }, "total_edit_time": { "type": "number" }, "copied_from": { "type": "string" }, "edits": { "type": "number" }, "cid": { "type": "string" }, "data_type": { "type": "string" } }, "required": [ "created_username", "created_ago", "result", "honeypot", "lead_time", "id", "confidence_range", "updated_at", "updated_by", "submitted_at" ] } }, "predictions": { "type": "array", "items": {} }, "created_at": { "type": "string" }, "created_by": { "type": "string" }, "data": { "type": "object", "properties": { "image": { "type": "array", "items": { "type": "string" } }, "ocr_text": { "type": "array", "items": { "type": "array", "items": { "type": "object", "properties": { "x": { "type": "number" }, "y": { "type": "number" }, "width": { "type": "number" }, "height": { "type": "number" }, "rotation": { "type": "number" }, "text": { "type": "string" } }, "required": [ "x", "y", "width", "height", "rotation", "text" ] } } }, "ocr_plain_text": { "type": "string" }, "title": { "type": "string" } }, "required": [ "image", "ocr_text", "ocr_plain_text", "title" ] }, "id": { "type": "number" }, "sections": { "type": "array", "items": {} } }, "required": [ "completions", "predictions", "created_at", "created_by", "data", "id", "sections" ] } }

  1. How do I go from one JSON file with the described schema to multiple JSON files with the FUNSD schema format?
  2. What are the required fields for the FUNSD format? (to finetune a LiLT based model) ?
  3. We don't have linking (we annotated just the required data to be detected or extracted, let's say the values), but it can work, right? If yes, does it impact the LiLT model's performance after training?

Thank you @NielsRogge so much for you effort/help.

iwaqas commented 9 months ago

@iwaqas did you found any solution/way/hack for that please ?

Ahhh! Have not been lucky enough :(

NielsRogge commented 9 months ago

As said above, the only thing you need for each document page is a list of words + corresponding coordinates (bounding boxes) and labels. For FUNSD, you can see that in the screenshot you shared above: for the document we have the following:

This is all you need to fine-tune a LayoutLM or LiLT model (i.e. xxxForTokenClassification) to extract entities from a given document. Whether they are stored in JSON or a txt file, that does not really matter.

Entity linking is a separate task, for which no models are currently available in the Transformers library. This task is called entity relation extraction/entity linking and requires a separate model like LayoutLMForRelationExtraction, which for the moment is not present in the library.

mellahysf commented 9 months ago

@NielsRogge clear! thank you.

ameni-ayedi commented 1 month ago

@NielsRogge When using labels to train LILT model i get a warning stating that the labels aren't NER tags since they're not in IOB format (usr/local/lib/python3.10/dist-packages/seqeval/metrics/sequence_labeling.py:171: UserWarning: Contact Info Section seems not to be NE tag. warnings.warn('{} seems not to be NE tag.'.format(chunk))) should this warning be ignored then?