LAHTeR / document_segmentation

Tool for segmenting and classifying document boundaries.
1 stars 1 forks source link

Add heuristics for fixing BEGIN/END #95

Closed carschno closed 4 months ago

carschno commented 4 months ago

Sometimes, the model outputs sequences such as OUT,IN,IN. Logically, there should always be a BEGIN to mark the transition between OUT and IN. To fix this, add a heuristic to convert the first IN into BEGIN: OUT,BEGIN,IN. Same with IN,IN,OUT to IN,END,OUT.

Proposed solution

implement a function that takes the model output scores for a page sequence, and generates to labels for that sequence.

  1. basic case: use the argmax label
  2. for sequence OUT, X, IN, convert X to BEGIN
  3. for sequence IN, X, OUT, convert X to END
  4. for sequence OUT, X is BEGIN or END, OUT: convert X to BEGIN_END ...