Document Classification

Apart from identifying document boundaries, the segmented document should be classified. We use the TANAP categories. We have three collections for annotated documents:

https://github.com/globalise-huygens/Inventorization-and-Metadata/tree/main/Analysis%20TANAP%20categorisation
https://github.com/globalise-huygens/Inventorization-and-Metadata/tree/main/Analysis%20inventory%20numbers%20(document%20segmentation)
Generale Missiven

For each we can look up the TANAP category with the following steps:

Get the TANAP ID of the document
Look up the document in the TANAP sheet in Renate's annotation file
Extract the RUBRIEK from the TYPE column
Match the RUBRIEK to a top-level category annotated in the Categoriecodes tab in the same file

There are 13 top-level categories that can be combined with the document boundaries: a) merge the category and the segmentation type, e.g. BEGIN-1, IN-1, END-1 etc. b) train a model that jointly learns from document boundaries and categories, and outputs BEGIN and 1 as separately c) train a separate model for categorizing documents in a dedicated step after extracting the individual documents

LAHTeR / document_segmentation

Document Classification #100