Input data preprocessing / task formulation

agademic commented 3 years ago

First of all: Thank you for open sourcing your model and congrats on the impressive results in the various datasets!

I am attempting to reproduce your results on the SROIE (and kleister) dataset and I am wondering how you pre-processed the source/target data. Did you formulate the learning task as a vanilla NER-task (i.e. BIO annotate the data) or did you formulate it as a question-answering task (or something else)?

In your kleister-charity dataset, you provide the expected.tsv file with entries such as: income_annually_in_british_pounds=10348000.00 report_date=2016-03-31. I am not sure though if that is also your final target format.

Any help is appreciated!

lukgarn commented 3 years ago

Thanks for interest in our work!

We followed the procedure described in Section 4.1 of Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts. Basically, we solved a vanilla NER task, followed by normalization and aggregation of multiple occurrences of a given entity in the document. For instance, dates were normalized to the yyyy-mm-dd format, like in the expected.tsv file.

agademic commented 3 years ago

Great! Thanks a lot for the hints!

applicaai / lambert

Input data preprocessing / task formulation #2