Apart from identifying document boundaries, the segmented document should be classified.
We use the TANAP categories.
We have three collections for annotated documents:
Match the RUBRIEK to a top-level category annotated in the Categoriecodes tab in the same file
There are 13 top-level categories that can be combined with the document boundaries:
a) merge the category and the segmentation type, e.g. BEGIN-1, IN-1, END-1 etc.
b) train a model that jointly learns from document boundaries and categories, and outputs BEGIN and 1 as separately
c) train a separate model for categorizing documents in a dedicated step after extracting the individual documents
Apart from identifying document boundaries, the segmented document should be classified. We use the TANAP categories. We have three collections for annotated documents:
For each we can look up the TANAP category with the following steps:
TANAP
sheet in Renate's annotation fileRUBRIEK
from theTYPE
columnRUBRIEK
to a top-level category annotated in theCategoriecodes
tab in the same fileThere are 13 top-level categories that can be combined with the document boundaries: a) merge the category and the segmentation type, e.g.
BEGIN-1
,IN-1
,END-1
etc. b) train a model that jointly learns from document boundaries and categories, and outputsBEGIN
and1
as separately c) train a separate model for categorizing documents in a dedicated step after extracting the individual documents