DataCatalogue / grobid-datacat

A GROBID module for extracting and automatically structuring sale catalogues into TEI-XML format.
0 stars 0 forks source link

High level segmentation model - Line parsing - INHA #2

Closed HugoSchtr closed 2 years ago

HugoSchtr commented 2 years ago

This model is trained using only catalogues being kept by the INHA. Documents are then more alike, and less heterogeneous than for those used for the mixed model.

HugoSchtr commented 2 years ago

1st training

121 tei files
Total data found between CRF and TEI files 223 from total 235 examples.
Total data found between CRF and TEI files 364 from total 429 examples.
Total data found between CRF and TEI files 297 from total 341 examples.
Total data found between CRF and TEI files 362 from total 376 examples.
Total data found between CRF and TEI files 80 from total 85 examples.
Total data found between CRF and TEI files 654 from total 773 examples.
Total data found between CRF and TEI files 309 from total 350 examples.
Total data found between CRF and TEI files 280 from total 310 examples.
Total data found between CRF and TEI files 198 from total 213 examples.
Total data found between CRF and TEI files 1637 from total 1801 examples.
Total data found between CRF and TEI files 297 from total 393 examples.
Total data found between CRF and TEI files 184 from total 195 examples.
Total data found between CRF and TEI files 346 from total 387 examples.
Total data found between CRF and TEI files 142 from total 156 examples.
Total data found between CRF and TEI files 93 from total 108 examples.
Total data found between CRF and TEI files 190 from total 240 examples.
Total data found between CRF and TEI files 89 from total 106 examples.
Total data found between CRF and TEI files 218 from total 233 examples.
Total data found between CRF and TEI files 267 from total 329 examples.
Total data found between CRF and TEI files 149 from total 161 examples.
Total data found between CRF and TEI files 212 from total 243 examples.
Total data found between CRF and TEI files 241 from total 273 examples.
Total data found between CRF and TEI files 239 from total 287 examples.
Total data found between CRF and TEI files 121 from total 129 examples.
Total data found between CRF and TEI files 300 from total 330 examples.
Total data found between CRF and TEI files 154 from total 175 examples.
Total data found between CRF and TEI files 149 from total 162 examples.
Total data found between CRF and TEI files 230 from total 243 examples.
Total data found between CRF and TEI files 201 from total 217 examples.
Total data found between CRF and TEI files 223 from total 258 examples.
Total data found between CRF and TEI files 1038 from total 1182 examples.
Total data found between CRF and TEI files 411 from total 462 examples.
Total data found between CRF and TEI files 198 from total 220 examples.
Total data found between CRF and TEI files 320 from total 387 examples.
Total data found between CRF and TEI files 532 from total 692 examples.
Total data found between CRF and TEI files 45 from total 146 examples.
Total data found between CRF and TEI files 375 from total 442 examples.
Total data found between CRF and TEI files 253 from total 280 examples.
Total data found between CRF and TEI files 405 from total 431 examples.
Total data found between CRF and TEI files 230 from total 245 examples.
Total data found between CRF and TEI files 157 from total 178 examples.
Total data found between CRF and TEI files 238 from total 317 examples.
Total data found between CRF and TEI files 387 from total 416 examples.
Total data found between CRF and TEI files 1675 from total 1801 examples.
Total data found between CRF and TEI files 305 from total 319 examples.
Total data found between CRF and TEI files 1187 from total 1701 examples.
Total data found between CRF and TEI files 222 from total 249 examples.
Total data found between CRF and TEI files 126 from total 145 examples.
Total data found between CRF and TEI files 980 from total 1054 examples.
Total data found between CRF and TEI files 312 from total 477 examples.
Total data found between CRF and TEI files 146 from total 171 examples.
Total data found between CRF and TEI files 1859 from total 2376 examples.
Total data found between CRF and TEI files 317 from total 387 examples.
Total data found between CRF and TEI files 241 from total 262 examples.
Total data found between CRF and TEI files 228 from total 234 examples.
Total data found between CRF and TEI files 21 from total 27 examples.
Total data found between CRF and TEI files 737 from total 900 examples.
Total data found between CRF and TEI files 720 from total 909 examples.
Total data found between CRF and TEI files 305 from total 323 examples.
Total data found between CRF and TEI files 209 from total 238 examples.
Total data found between CRF and TEI files 189 from total 205 examples.
Total data found between CRF and TEI files 55 from total 57 examples.
Total data found between CRF and TEI files 140 from total 153 examples.
Total data found between CRF and TEI files 301 from total 337 examples.
Total data found between CRF and TEI files 168 from total 177 examples.
Total data found between CRF and TEI files 475 from total 510 examples.
Total data found between CRF and TEI files 430 from total 511 examples.
Total data found between CRF and TEI files 386 from total 468 examples.
Total data found between CRF and TEI files 204 from total 220 examples.
Total data found between CRF and TEI files 204 from total 238 examples.
Total data found between CRF and TEI files 1314 from total 1525 examples.
Total data found between CRF and TEI files 646 from total 897 examples.
Total data found between CRF and TEI files 86 from total 89 examples.
Total data found between CRF and TEI files 1432 from total 1726 examples.
Total data found between CRF and TEI files 214 from total 247 examples.
Total data found between CRF and TEI files 19 from total 22 examples.
Total data found between CRF and TEI files 341 from total 394 examples.
Total data found between CRF and TEI files 207 from total 227 examples.
Total data found between CRF and TEI files 330 from total 359 examples.
Total data found between CRF and TEI files 283 from total 303 examples.
Total data found between CRF and TEI files 698 from total 802 examples.
Total data found between CRF and TEI files 432 from total 542 examples.
Total data found between CRF and TEI files 47 from total 52 examples.
Total data found between CRF and TEI files 202 from total 218 examples.
Total data found between CRF and TEI files 448 from total 598 examples.
Total data found between CRF and TEI files 113 from total 144 examples.
Total data found between CRF and TEI files 4673 from total 5075 examples.
Total data found between CRF and TEI files 450 from total 562 examples.
Total data found between CRF and TEI files 131 from total 152 examples.
Total data found between CRF and TEI files 228 from total 247 examples.
Total data found between CRF and TEI files 540 from total 568 examples.
Total data found between CRF and TEI files 519 from total 540 examples.
Total data found between CRF and TEI files 388 from total 442 examples.
Total data found between CRF and TEI files 295 from total 306 examples.
Total data found between CRF and TEI files 215 from total 224 examples.
Total data found between CRF and TEI files 182 from total 205 examples.
Total data found between CRF and TEI files 359 from total 393 examples.
Total data found between CRF and TEI files 569 from total 746 examples.
Total data found between CRF and TEI files 87 from total 95 examples.
Total data found between CRF and TEI files 40 from total 45 examples.
Total data found between CRF and TEI files 124 from total 141 examples.
Total data found between CRF and TEI files 335 from total 403 examples.
Total data found between CRF and TEI files 690 from total 805 examples.
Total data found between CRF and TEI files 1486 from total 1658 examples.
Total data found between CRF and TEI files 29 from total 35 examples.
Total data found between CRF and TEI files 289 from total 370 examples.
Total data found between CRF and TEI files 71 from total 79 examples.
Total data found between CRF and TEI files 4432 from total 4975 examples.
Total data found between CRF and TEI files 453 from total 535 examples.
Total data found between CRF and TEI files 118 from total 165 examples.
Total data found between CRF and TEI files 191 from total 202 examples.
Total data found between CRF and TEI files 164 from total 195 examples.
Total data found between CRF and TEI files 155 from total 175 examples.
Total data found between CRF and TEI files 1330 from total 1638 examples.
Total data found between CRF and TEI files 87 from total 106 examples.
Total data found between CRF and TEI files 238 from total 252 examples.
Total data found between CRF and TEI files 1196 from total 1325 examples.
Total data found between CRF and TEI files 595 from total 751 examples.
Total data found between CRF and TEI files 163 from total 191 examples.
Total data found between CRF and TEI files 65 from total 71 examples.
    epsilon: 1.0E-7
    window: 50
    nb max iterations: 1500
    nb threads: 16
Model for monograph created in 548705 ms
sourceTEIPathLabel: /home/hscheith/dev/grobid/grobid-trainer/../grobid-home/../grobid-trainer/resources/dataset/monograph/evaluation/tei
sourceRawPathLabel: /home/hscheith/dev/grobid/grobid-trainer/../grobid-home/../grobid-trainer/resources/dataset/monograph/evaluation/raw
trainingOutputPath: /home/hscheith/dev/grobid/grobid-trainer/../grobid-home/tmp/monograph2848827582332905784.test
evalOutputPath: null
23 tei files
Total data found between CRF and TEI files 661 from total 828 examples.
Total data found between CRF and TEI files 568 from total 597 examples.
Total data found between CRF and TEI files 214 from total 235 examples.
Total data found between CRF and TEI files 176 from total 186 examples.
Total data found between CRF and TEI files 367 from total 423 examples.
Total data found between CRF and TEI files 383 from total 504 examples.
Total data found between CRF and TEI files 262 from total 338 examples.
Total data found between CRF and TEI files 304 from total 365 examples.
Total data found between CRF and TEI files 168 from total 196 examples.
Total data found between CRF and TEI files 167 from total 192 examples.
Total data found between CRF and TEI files 81 from total 94 examples.
Total data found between CRF and TEI files 104 from total 116 examples.
Total data found between CRF and TEI files 302 from total 335 examples.
Total data found between CRF and TEI files 25 from total 26 examples.
Total data found between CRF and TEI files 187 from total 338 examples.
Total data found between CRF and TEI files 0 from total 17 examples.
Total data found between CRF and TEI files 114 from total 122 examples.
Total data found between CRF and TEI files 156 from total 175 examples.
Total data found between CRF and TEI files 180 from total 193 examples.
Total data found between CRF and TEI files 287 from total 330 examples.
Total data found between CRF and TEI files 255 from total 269 examples.
Total data found between CRF and TEI files 68 from total 83 examples.
Labeling took: 168 ms

===== Field-level results =====

label                accuracy     precision    recall       f1           support

<back>               96.03        0            0            0            4      
<cover>              86.09        50           33.33        40           21     
<preface>            87.42        55           52.38        53.66        21     
<title>              84.11        47.62        43.48        45.45        23     
<unit>               81.46        38.1         34.78        36.36        23     

all (micro avg.)     87.02        46.15        39.13        42.35        92     
all (macro avg.)     87.02        38.14        32.8         35.1         92     

===== Instance-level results =====

Total expected instances:   21
Correct instances:          1
Instance-level recall:      4.76

See :arrow_right: inha_byLines_1st_training.txt

HugoSchtr commented 2 years ago

Since we want our model to be able to generalize, we are abandoning the idea of training a model for every collection.