LAHTeR / document_segmentation

Tool for segmenting and classifying document boundaries.
1 stars 1 forks source link

Document Classification #100

Closed carschno closed 3 months ago

carschno commented 4 months ago

Apart from identifying document boundaries, the segmented document should be classified. We use the TANAP categories. We have three collections for annotated documents:

  1. https://github.com/globalise-huygens/Inventorization-and-Metadata/tree/main/Analysis%20TANAP%20categorisation
  2. https://github.com/globalise-huygens/Inventorization-and-Metadata/tree/main/Analysis%20inventory%20numbers%20(document%20segmentation)
  3. Generale Missiven

For each we can look up the TANAP category with the following steps:

  1. Get the TANAP ID of the document
  2. Look up the document in the TANAP sheet in Renate's annotation file
  3. Extract the RUBRIEK from the TYPE column
  4. Match the RUBRIEK to a top-level category annotated in the Categoriecodes tab in the same file

There are 13 top-level categories that can be combined with the document boundaries: a) merge the category and the segmentation type, e.g. BEGIN-1, IN-1, END-1 etc. b) train a model that jointly learns from document boundaries and categories, and outputs BEGIN and 1 as separately c) train a separate model for categorizing documents in a dedicated step after extracting the individual documents