DiSSCo / SDR

Specimen Data Refinery
Apache License 2.0
7 stars 0 forks source link

Evaluate Galaxy tool for HTR/OCR #95

Open llivermore opened 2 years ago

llivermore commented 2 years ago

The Teklia handwritten text recognition (HTR) / optical character recognition (OCR) worker tool (https://github.com/DiSSCo/SDR/issues/39) requires further testing at scale (e.g., full range of preparation types and larger test datasets) and potential improvements to the model.

llivermore commented 2 years ago

@La0 I have run a 50 specimen dataset against images the model has not seen. The text line detection is reasonable: there are are few merged lines, missed lines and false positives (usually barcodes). The issue is the HTR output which is barely outputting data for most identified lines.

Could you take a look at the HTR output issue? Is this format reasonable for you to get into Arkindex/Label Studio for retraining the DLA worker? Would you recommend retraining any ones with issues or would you recommend a more focused approach?

sdr_output -50-pinned.zip

La0 commented 2 years ago

Thanks for generating the test dataset, we'll look into it and report back here.

La0 commented 2 years ago

We just published a new release for the HTR tool, version 0.1.2.

This version was tested on your images & DLA results, and provides better transcriptions.

Here is the diff between your results and ours on the file split_file_000000

53c53
<                 "confidence": 0.7633
---
>                 "confidence": 0.3321
77c77,80
<             "transcription": null
---
>             "transcription": {
>                 "text": "No .",
>                 "confidence": 0.4849
>             }
100c103,106
<             "transcription": null
---
>             "transcription": {
>                 "text": "B . M .  TYPE",
>                 "confidence": 0.5027
>             }
124,125c130,131
<                 "text": "2",
<                 "confidence": 0.6471
---
>                 "text": "17  .  B .  165",
>                 "confidence": 0.2085
172c178,181
<             "transcription": null
---
>             "transcription": {
>                 "text": "Topan Train",
>                 "confidence": 0.1865
>             }
195c204,207
<             "transcription": null
---
>             "transcription": {
>                 "text": "Megacteria",
>                 "confidence": 0.3122
>             }
218c230,233
<             "transcription": null
---
>             "transcription": {
>                 "text": "SANDWICH ISL .",
>                 "confidence": 0.3399
>             }
241c256,259
<             "transcription": null
---
>             "transcription": {
>                 "text": "Exchanged",
>                 "confidence": 0.4168
>             }
264c282,285
<             "transcription": null
---
>             "transcription": {
>                 "text": "E . W . H .  Holdwoodi",
>                 "confidence": 0.3635
>             }
287c308,311
<             "transcription": null
---
>             "transcription": {
>                 "text": "Lygia ery ,  1949",
>                 "confidence": 0.2967
>             }
310c334,337
<             "transcription": null
---
>             "transcription": {
>                 "text": "Sandwfich )  Is .   ( Fabr .  )",
>                 "confidence": 0.3169
>             }
333c360,363
<             "transcription": null
---
>             "transcription": {
>                 "text": "The species in identifical",
>                 "confidence": 0.4534
>             }
357,358c387,388
<                 "text": "Gratidia",
<                 "confidence": 0.5603
---
>                 "text": "with the introduced  .",
>                 "confidence": 0.4199
383,384c413,414
<                 "text": "and",
<                 "confidence": 0.6711
---
>                 "text": "( Taeneipennis degeer )",
>                 "confidence": 0.4459
408c438,441
<             "transcription": null
---
>             "transcription": {
>                 "text": "NHMUK",
>                 "confidence": 0.6046
>             }
431c464,467
<             "transcription": null
---
>             "transcription": {
>                 "text": "_",
>                 "confidence": 0.2482
>             }
455,456c491,492
<                 "text": "Acts which",
<                 "confidence": 0.6332
---
>                 "text": "M . A . Cieftinch SS",
>                 "confidence": 0.2674
481,482c517,518
<                 "text": "01254",
<                 "confidence": 0.4407
---
>                 "text": "010265364",
>                 "confidence": 0.3713
507,508c543,544
<                 "text": "=",
<                 "confidence": 0.7559
---
>                 "text": "- TE",
>                 "confidence": 0.2042

You can see we have a lot more transcriptions now instead of null.

@llivermore Could you try this new version on your side ? We are also looking at DLA results: they may not be optimal with lower image resolutions (the higher resolution the better for DLA).