DataCatalogue / grobid-datacat

A GROBID module for extracting and automatically structuring sale catalogues into TEI-XML format.
0 stars 0 forks source link

Simpler high level segmentation - line parsing - mixed #5

Open HugoSchtr opened 2 years ago

HugoSchtr commented 2 years ago

Since the model's performance for the high level segmentation won't go higher than ~45% recall/precision/F1, we're trying a new segmentation, much simpler:

As the last experiments, we're aiming for a very good performance for body, which contains all sale catalogues entries.

HugoSchtr commented 2 years ago

Let's test a training on a sample of re-annotated files:

> Task :train_datacat-segmenter
16:03:34.793 [main] DEBUG org.grobid.core.utilities.GrobidProperties - synchronized getNewInstance
16:03:34.800 [main] WARN org.grobid.core.main.GrobidHomeFinder - No Grobid property was provided. Attempting to find Grobid home in the current directory...
16:03:34.800 [main] WARN org.grobid.core.main.GrobidHomeFinder - ***************************************************************
16:03:34.800 [main] WARN org.grobid.core.main.GrobidHomeFinder - *** USING GROBID HOME: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home
16:03:34.800 [main] WARN org.grobid.core.main.GrobidHomeFinder - ***************************************************************
16:03:34.800 [main] DEBUG org.grobid.core.utilities.GrobidProperties - loading grobid config yaml
16:03:34.800 [main] WARN org.grobid.core.main.GrobidHomeFinder - Grobid config file location was not explicitly set via 'org.grobid.config' system variable, defaulting to: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/config/grobid.yaml
16:03:34.957 [main] DEBUG org.grobid.core.utilities.GrobidProperties - loading pdfalto command path
16:03:34.959 [main] DEBUG org.grobid.core.utilities.GrobidProperties - pdfalto executable home directory set to /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/pdfalto/lin-64
16:03:34.965 [main] INFO org.grobid.core.main.LibraryLoader - Loading external native sequence labelling library
16:03:34.965 [main] DEBUG org.grobid.core.main.LibraryLoader - /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/lib/lin-64
16:03:34.969 [main] INFO org.grobid.core.main.LibraryLoader - Loading Wapiti native library...
16:03:34.970 [main] INFO org.grobid.core.main.LibraryLoader - Native library for sequence labelling loaded
16:03:34.971 [main] DEBUG org.grobid.core.lexicon.Lexicon - Get new instance of Lexicon
16:03:34.971 [main] INFO org.grobid.core.lexicon.Lexicon - Initiating dictionary
16:03:34.971 [main] INFO org.grobid.core.lexicon.Lexicon - End of Initialization of dictionary
16:03:34.971 [main] INFO org.grobid.core.lexicon.Lexicon - Initiating names
16:03:34.971 [main] INFO org.grobid.core.lexicon.Lexicon - End of initialization of names
16:03:35.253 [main] INFO org.grobid.core.lexicon.Lexicon - Initiating country codes
16:03:35.253 [main] INFO org.grobid.core.lexicon.Lexicon - End of initialization of country codes
sourceTEIPathLabel: /home/hscheith/dev/grobid/grobid-datacat/resources/dataset/datacat-segmenter/corpus/tei
sourceRawPathLabel: /home/hscheith/dev/grobid/grobid-datacat/resources/dataset/datacat-segmenter/corpus/raw
trainingOutputPath: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/tmp/datacat-segmenter5585300605279619429.train
evalOutputPath: null
82 tei files
16:03:35.346 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k97781698.training.monograph.tei.xml
Total data found between CRF and TEI files 1322 from total 1785 examples.
16:03:35.692 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k97775611.training.monograph.tei.xml
Total data found between CRF and TEI files 1342 from total 1628 examples.
16:03:35.851 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k97780940.training.monograph.tei.xml
Total data found between CRF and TEI files 67 from total 97 examples.
[...]
16:17:33.431 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777635w.training.monograph.tei.xml
Total data found between CRF and TEI files 677 from total 733 examples.
16:17:33.440 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777373j.training.monograph.tei.xml
Total data found between CRF and TEI files 779 from total 915 examples.
16:17:33.465 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777815t.training.monograph.tei.xml
Total data found between CRF and TEI files 424 from total 593 examples.
16:17:33.488 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777743s.training.monograph.tei.xml
Total data found between CRF and TEI files 1183 from total 2130 examples.
16:17:33.740 [main] DEBUG org.grobid.core.utilities.GrobidProperties - No configuration parameter defined for DeLFT engine for model datacat-segmenter
16:17:33.741 [main] INFO org.grobid.core.jni.WapitiModel - Loading model: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/models/datacat-segmenter/model.wapiti (size: 3424408)
Labeling took: 310 ms

===== Field-level results =====

label                accuracy     precision    recall       f1           support

<annex>              84.5         21.43        25           23.08        12     
<back>               92.25        0            0            0            7      
<body>               79.84        31.58        31.58        31.58        19     
<front>              65.12        28           20.59        23.73        34     

all (micro avg.)     80.43        26.23        22.22        24.06        72     
all (macro avg.)     80.43        20.25        19.29        19.6         72     

===== Instance-level results =====

Total expected instances:   18
Correct instances:          2
Instance-level recall:      11.11

Results are not satisfying enough with that much data.

HugoSchtr commented 2 years ago

Let's try another sample, with different and more documents:

> Task :train_datacat-segmenter
16:23:38.410 [main] DEBUG org.grobid.core.utilities.GrobidProperties - synchronized getNewInstance
16:23:38.414 [main] WARN org.grobid.core.main.GrobidHomeFinder - No Grobid property was provided. Attempting to find Grobid home in the current directory...
16:23:38.414 [main] WARN org.grobid.core.main.GrobidHomeFinder - ***************************************************************
16:23:38.414 [main] WARN org.grobid.core.main.GrobidHomeFinder - *** USING GROBID HOME: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home
16:23:38.414 [main] WARN org.grobid.core.main.GrobidHomeFinder - ***************************************************************
16:23:38.414 [main] DEBUG org.grobid.core.utilities.GrobidProperties - loading grobid config yaml
16:23:38.414 [main] WARN org.grobid.core.main.GrobidHomeFinder - Grobid config file location was not explicitly set via 'org.grobid.config' system variable, defaulting to: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/config/grobid.yaml
16:23:38.545 [main] DEBUG org.grobid.core.utilities.GrobidProperties - loading pdfalto command path
16:23:38.546 [main] DEBUG org.grobid.core.utilities.GrobidProperties - pdfalto executable home directory set to /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/pdfalto/lin-64
16:23:38.550 [main] INFO org.grobid.core.main.LibraryLoader - Loading external native sequence labelling library
16:23:38.550 [main] DEBUG org.grobid.core.main.LibraryLoader - /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/lib/lin-64
16:23:38.553 [main] INFO org.grobid.core.main.LibraryLoader - Loading Wapiti native library...
16:23:38.553 [main] INFO org.grobid.core.main.LibraryLoader - Native library for sequence labelling loaded
16:23:38.554 [main] DEBUG org.grobid.core.lexicon.Lexicon - Get new instance of Lexicon
16:23:38.554 [main] INFO org.grobid.core.lexicon.Lexicon - Initiating dictionary
16:23:38.554 [main] INFO org.grobid.core.lexicon.Lexicon - End of Initialization of dictionary
16:23:38.554 [main] INFO org.grobid.core.lexicon.Lexicon - Initiating names
16:23:38.554 [main] INFO org.grobid.core.lexicon.Lexicon - End of initialization of names
16:23:38.791 [main] INFO org.grobid.core.lexicon.Lexicon - Initiating country codes
16:23:38.791 [main] INFO org.grobid.core.lexicon.Lexicon - End of initialization of country codes
sourceTEIPathLabel: /home/hscheith/dev/grobid/grobid-datacat/resources/dataset/datacat-segmenter/corpus/tei
sourceRawPathLabel: /home/hscheith/dev/grobid/grobid-datacat/resources/dataset/datacat-segmenter/corpus/raw
trainingOutputPath: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/tmp/datacat-segmenter9332586470912852627.train
evalOutputPath: null
130 tei files
16:23:38.867 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9780625m.training.datacat.tei.xml
Total data found between CRF and TEI files 666 from total 875 examples.
16:23:39.043 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k97781557.training.datacat.tei.xml
Total data found between CRF and TEI files 136 from total 227 examples.
16:23:39.060 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-cb40908886q.training.datacat.tei.xml
Total data found between CRF and TEI files 421 from total 653 examples.
16:23:39.133 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k12438933.training.datacat.tei.xml
Total data found between CRF and TEI files 341 from total 432 examples.
16:23:39.149 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9779220p.training.datacat.tei.xml
Total data found between CRF and TEI files 2157 from total 3548 examples.
[...]
16:49:07.156 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777796x.training.datacat.tei.xml
Total data found between CRF and TEI files 988 from total 1403 examples.
16:49:07.232 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777643f.training.datacat.tei.xml
Total data found between CRF and TEI files 703 from total 1302 examples.
16:49:07.337 [main] DEBUG org.grobid.core.utilities.GrobidProperties - No configuration parameter defined for DeLFT engine for model datacat-segmenter
16:49:07.337 [main] INFO org.grobid.core.jni.WapitiModel - Loading model: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/models/datacat-segmenter/model.wapiti (size: 9273151)
Labeling took: 1232 ms

===== Field-level results =====

label                accuracy     precision    recall       f1           support

<annex>              98.25        0            0            0            1      
<back>               92.98        0            0            0            10     
<body>               78.36        48.48        44.44        46.38        36     
<front>              67.84        42.42        28           33.73        50     

all (micro avg.)     84.36        42.86        30.93        35.93        97     
all (macro avg.)     84.36        22.73        18.11        20.03        97     

===== Instance-level results =====

Total expected instances:   33
Correct instances:          11
Instance-level recall:      33.33

Results are better than previous training, and better than previous models trained with more complex labels and more documents. This is encouraging.

HugoSchtr commented 2 years ago

With the new debugged high-level segmentation model, training is now working as intended. Scores are now way better, and even if they're not perfect yet, extraction is already satisfying.

11:31:42.396 [main] DEBUG org.grobid.core.utilities.GrobidProperties - synchronized getNewInstance
11:31:42.401 [main] WARN org.grobid.core.main.GrobidHomeFinder - No Grobid property was provided. Attempting to find Grobid home in the current directory...
11:31:42.401 [main] WARN org.grobid.core.main.GrobidHomeFinder - ***************************************************************
11:31:42.401 [main] WARN org.grobid.core.main.GrobidHomeFinder - *** USING GROBID HOME: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home
11:31:42.401 [main] WARN org.grobid.core.main.GrobidHomeFinder - ***************************************************************
11:31:42.401 [main] DEBUG org.grobid.core.utilities.GrobidProperties - loading grobid config yaml
11:31:42.401 [main] WARN org.grobid.core.main.GrobidHomeFinder - Grobid config file location was not explicitly set via 'org.grobid.config' system variable, defaulting to: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/config/grobid.yaml
11:31:42.544 [main] DEBUG org.grobid.core.utilities.GrobidProperties - loading pdfalto command path
11:31:42.545 [main] DEBUG org.grobid.core.utilities.GrobidProperties - pdfalto executable home directory set to /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/pdfalto/lin-64
11:31:42.550 [main] INFO org.grobid.core.main.LibraryLoader - Loading external native sequence labelling library
11:31:42.550 [main] DEBUG org.grobid.core.main.LibraryLoader - /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/lib/lin-64
11:31:42.553 [main] INFO org.grobid.core.main.LibraryLoader - Loading Wapiti native library...
11:31:42.553 [main] INFO org.grobid.core.main.LibraryLoader - Native library for sequence labelling loaded
11:31:42.554 [main] DEBUG org.grobid.core.lexicon.Lexicon - Get new instance of Lexicon
11:31:42.554 [main] INFO org.grobid.core.lexicon.Lexicon - Initiating dictionary
11:31:42.554 [main] INFO org.grobid.core.lexicon.Lexicon - End of Initialization of dictionary
11:31:42.554 [main] INFO org.grobid.core.lexicon.Lexicon - Initiating names
11:31:42.554 [main] INFO org.grobid.core.lexicon.Lexicon - End of initialization of names
11:31:42.805 [main] INFO org.grobid.core.lexicon.Lexicon - Initiating country codes
11:31:42.805 [main] INFO org.grobid.core.lexicon.Lexicon - End of initialization of country codes
sourceTEIPathLabel: /home/hscheith/dev/grobid/grobid-datacat/resources/dataset/datacat-segmenter/corpus/tei
sourceRawPathLabel: /home/hscheith/dev/grobid/grobid-datacat/resources/dataset/datacat-segmenter/corpus/raw
trainingOutputPath: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/tmp/datacat-segmenter11504674491793600357.train
evalOutputPath: null
363 tei files
11:31:42.940 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9779206d.training.datacat.tei.xml
11:31:43.000 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9778458h.training.datacat.tei.xml
11:31:43.007 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9781821z.training.datacat.tei.xml
11:31:43.018 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9779365m.training.datacat.tei.xml
11:31:43.020 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9780625m.training.datacat.tei.xml
[...]
11:31:46.320 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9780628v.training.datacat.tei.xml
    epsilon: 1.0E-7
    window: 50
    nb max iterations: 2000
    nb threads: 16
Model for datacat-segmenter created in 6530131 ms
sourceTEIPathLabel: /home/hscheith/dev/grobid/grobid-datacat/resources/dataset/datacat-segmenter/evaluation/tei
sourceRawPathLabel: /home/hscheith/dev/grobid/grobid-datacat/resources/dataset/datacat-segmenter/evaluation/raw
trainingOutputPath: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/tmp/datacat-segmenter6461545812222573652.test
evalOutputPath: null
73 tei files
13:20:33.071 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777971s.training.datacat.tei.xml
13:20:33.082 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777565p.training.datacat.tei.xml
13:20:33.086 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777416v.training.datacat.tei.xml
13:20:33.114 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777569b.training.datacat.tei.xml
13:20:33.119 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777635w.training.datacat.tei.xml
13:20:33.125 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777738g.training.datacat.tei.xml
13:20:33.136 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777980r.training.datacat.tei.xml
13:20:33.139 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777764z.training.datacat.tei.xml
13:20:33.151 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777648h.training.datacat.tei.xml
13:20:33.154 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 2148-bpt6k97784009.training.datacat.tei.xml
13:20:33.155 [main] ERROR org.grobid.trainer.AbstractTrainer - The raw file does not exist: /home/hscheith/dev/grobid/grobid-datacat/resources/dataset/datacat-segmenter/evaluation/raw/2148-bpt6k97784009.training.datacat
13:20:33.155 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777351z.training.datacat.tei.xml
[...]
13:20:33.730 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777387k.training.datacat.tei.xml
13:20:33.737 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9777385r.training.datacat.tei.xml
13:20:33.750 [main] DEBUG org.grobid.core.utilities.GrobidProperties - No configuration parameter defined for DeLFT engine for model datacat-segmenter
13:20:33.750 [main] INFO org.grobid.core.jni.WapitiModel - Loading model: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/models/datacat-segmenter/model.wapiti (size: 37965878)
Labeling took: 5812 ms

===== Field-level results =====

label                accuracy     precision    recall       f1           support

<annex>              96.52        58.33        63.64        60.87        22     
<back>               93.04        55.17        41.03        47.06        39     
<body>               87.81        58.11        57.33        57.72        75     
<front>              68.86        44.44        32.21        37.35        149    

all (micro avg.)     86.56        51.49        42.46        46.54        285    
all (macro avg.)     86.56        54.01        48.55        50.75        285    

===== Instance-level results =====

Total expected instances:   72
Correct instances:          15
Instance-level recall:      20.83
HugoSchtr commented 2 years ago

163 files used, from the same collection (bienaimé-feuardent), random split on the corpus for evaluating the model.


[...]
14:02:54.391 [main] INFO org.grobid.core.lexicon.Lexicon - End of initialization of country codes
sourceTEIPathLabel: /home/hscheith/dev/grobid/grobid-datacat/resources/dataset/datacat-segmenter/corpus/tei
sourceRawPathLabel: /home/hscheith/dev/grobid/grobid-datacat/resources/dataset/datacat-segmenter/corpus/raw
trainingOutputPath: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/tmp/datacat-segmenter6091050958366937291.train
evalOutputPath: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/tmp/datacat-segmenter8384666151559798433.test
163 tei files
14:02:54.474 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9780625m.training.datacat.tei.xml
[...]
14:02:56.380 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9780628v.training.datacat.tei.xml
        epsilon: 1.0E-7
        window: 50
        nb max iterations: 2000
        nb threads: 16
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    152
    nb labels:   9
    nb blocks:   1534981
    nb features: 13814901
* Train the model with l-bfgs
  [   1] obj=280716.19  act=3363480  err= 7.65%/100.00% time=7.37s/7.37s
  [...]
  [ 247] obj=310.02     act=3214     err= 0.00%/ 0.66% time=4.97s/1525.94s
* Save the model
* Done
14:28:33.621 [main] DEBUG org.grobid.core.utilities.GrobidProperties - No configuration parameter defined for DeLFT engine for model datacat-segmenter
14:28:33.621 [main] INFO org.grobid.core.jni.WapitiModel - Loading model: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/models/datacat-segmenter/model.wapiti (size: 24164096)
[Wapiti] Loading model: "/home/hscheith/dev/grobid/grobid-datacat/../grobid-home/models/datacat-segmenter/model.wapiti"
Model path: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/models/datacat-segmenter/model.wapiti
Labeling took: 587 ms

===== Field-level results =====

label                accuracy     precision    recall       f1           support

<back>               88.89        0            0            0            1      
<body>               92.59        87.5         87.5         87.5         8      
<front>              70.37        55.56        55.56        55.56        9      

all (micro avg.)     83.95        63.16        66.67        64.86        18     
all (macro avg.)     83.95        47.69        47.69        47.69        18     

===== Instance-level results =====

Total expected instances:   8
Correct instances:          5
Instance-level recall:      62.5

Split, training and evaluation for datacat-segmenter model is realized in 1541156 ms

Observation: not enough data for each label, I believe.

Results varies depending on the evaluation set, for example, an other model trained on the same dataset, but with a different split:

===== Field-level results =====

label                accuracy     precision    recall       f1           support

<annex>              97.03        0            0            0            1      
<back>               89.11        33.33        22.22        26.67        9      
<body>               92.08        66.67        76.92        71.43        13     
<front>              65.35        31.25        17.24        22.22        29     

all (micro avg.)     85.89        43.59        32.69        37.36        52     
all (macro avg.)     85.89        32.81        29.1         30.08        52   
HugoSchtr commented 2 years ago

New training with the Bourgey corpus has the following scores (436 tei files, with 95% used for training, and 5% for evaluation).

répartition_bienaime_bourgey

436 tei files
13:54:43.836 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9779206d.training.datacat.tei.xml
[...]
13:54:47.547 [main] INFO org.grobid.trainer.AbstractTrainer - Processing: 12148-bpt6k9780628v.training.datacat.tei.xml
        epsilon: 1.0E-7
        window: 50
        nb max iterations: 1000
        nb threads: 16
* Load patterns
* Load training data
* Initialize the model
* Summary
    nb train:    405
    nb labels:   9
    nb blocks:   2501866
    nb features: 22516866
* Train the model with l-bfgs
  [   1] obj=1456315.19 act=5464835  err=22.25%/100.00% time=11.04s/11.04s
  [...]
  [ 715] obj=804.87     act=4649     err= 0.00%/ 0.99% time=8.62s/6631.52s
* Save the model
* Done
15:45:43.330 [main] DEBUG org.grobid.core.utilities.GrobidProperties - No configuration parameter defined for DeLFT engine for model datacat-segmenter
15:45:43.330 [main] INFO org.grobid.core.jni.WapitiModel - Loading model: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/models/datacat-segmenter/model.wapiti (size: 41682587)
[Wapiti] Loading model: "/home/hscheith/dev/grobid/grobid-datacat/../grobid-home/models/datacat-segmenter/model.wapiti"
Model path: /home/hscheith/dev/grobid/grobid-datacat/../grobid-home/models/datacat-segmenter/model.wapiti
Labeling took: 2058 ms

===== Field-level results =====

label                accuracy     precision    recall       f1           support

<annex>              98.33        87.5         87.5         87.5         8      
<back>               97.5         88.89        80           84.21        10     
<body>               85.83        67.86        70.37        69.09        27     
<front>              76.67        60.53        63.89        62.16        36     

all (micro avg.)     89.58        68.67        70.37        69.51        81     
all (macro avg.)     89.58        76.19        75.44        75.74        81     

===== Instance-level results =====

Total expected instances:   27
Correct instances:          15
Instance-level recall:      55.56

Let's try to balance the training corpus and observe the results.