DiSSCo / SDR

Specimen Data Refinery
Apache License 2.0
6 stars 0 forks source link

Validate/correct text line transcriptions in herbarium sheet dataset #68

Open martinteklia opened 2 years ago

martinteklia commented 2 years ago

The google vision output has been copied to the text lines you had annotated. Sometimes the matching of the polygons failed, or there were no transcriptions from google vision, but most of the text lines should have a transcription.

Now the goal is to validate (and correct if necessary) the text line transcriptions.

Here are some transcription guidelines:

  1. Transcribe verbatim - don't expand abbreviations
  2. Don't use markup in the transcriptions like in [London?] in January 1873 to mark uncertainty. If you get one letter wrong it's probably not a big deal, because there will be human errors anyway. If however, there's a bigger part of the line that you are unable to decipher, then add a classification unclear_transcription to the text_line element. (step 9 in the following screenshot list)
  3. Crossed out words should be ignored
  4. When correcting transcriptions - no need to correct spaces around punctuation or other symbols
  5. If the google vision transcription is already correct, then to validate means to click on the copy button (step 5 in the following screenshot list)

Guide for validating/correcting text line transcriptions on Arkindex

  1. Filter to have only text_line elements

1

  1. Display the elements on the image

3

  1. Start clicking on the text line elements on the list on the left

5

  1. Click on the A+ button to validate/correct the transcriptions or on Add or edit a manual transcription in the right pane

  2. In the transcription modal click on the copy button to copy it into a manual transcription that can be modified

8

  1. If the transcription was already correct then that's it - it has been validated.

  2. If it wasn't correct then click on the crayon button to edit the manual transcription

10

  1. If the line doesn't have any transcription from google vision then you can add it yourself

12

  1. If the you're unable to transcribe then add a class unclear_transcription to the text_line

new_1

new_3

martinteklia commented 2 years ago

@llivermore could you take a look if the guide is clear enough?

llivermore commented 2 years ago

@martinteklia it looks clear to me - I/we can update the project team in the meeting today and work on the lines during the next sprint.

martinteklia commented 2 years ago

A classification from_gold_standard_250 has been added to the sheets, that come from the gold standard. Don't correct these yet, because I will try to match the transcriptions from the gold standard to the new polygons and hopefully there will be less corrections needed.

llivermore commented 2 years ago

Hi @matdillen, @Cubey0, @emhaston, @droepert, and @kris-loh are you able to validate and correct text line transcriptions in the herbarium sheet dataset? I am hoping to get these done over Christmas.

llivermore commented 2 years ago

@kris-loh can you prioritise MNHN, Tartu and Naturalis? I will prioritise NHM, Kew and Luomus.

matdillen commented 2 years ago

@martinteklia I have a few questions, after doing one of the tasks:

Aside from this, I wonder if it is feasible to bulk copy all the manual transcriptions to all specimens not covered by from_gold_standard_250, but still make them editable by the validators (i.e. steps 4-5)? Then the validators would only have to edit this one if incorrect, rather than make all the clicks. I believe it may speed up the process and make it significantly less repetitive, as the majority of lines are correct.

martinteklia commented 2 years ago

@martinteklia I have a few questions, after doing one of the tasks:

* I can't seem to add unclear_transcription. I can find it in the dropdown, but clicking Add doesn't do anything. I wanted to add this for the two lines which are partially covered by the plant, and which the algorithm get more or less right at least for the text it could see.

Yes, there's a bug with adding a class. Thanks for pointing it out! We're working on fixing it.

* The barcode interprets a poorly printed `0` as a `)`. I assume it's fine to leave this transcription unedited?

Yes.

* There is a weird symbol in the text, an `ε` with a vertical dash around it. This seems to be a written variant of `&`, interpreted by the algorithm as an `e`. Is it fine to leave edge cases like this unedited?

If there's a lot of them, it would be better to use &, but if there's only a few then it can be kept unedited.

* Tasks [1](https://arkindex.teklia.com/element/f5238cd2-a89b-4fed-a44f-634f3bb7736d) and [2](https://arkindex.teklia.com/element/90b87db8-253e-4c91-aaf5-d7c8fa1f414c) for MeiseBG seem to have 2 text boxes for every line. Only one of the two has the GCV annotations.

All the text boxes there have been made by humans. Only best matching text box is chosen to put the GCV annotation - there is no point to transcribe everything twice.

Aside from this, I wonder if it is feasible to bulk copy all the manual transcriptions to all specimens not covered by from_gold_standard_250, but still make them editable by the validators (i.e. steps 4-5)? Then the validators would only have to edit this one if incorrect, rather than make all the clicks. I believe it may speed up the process and make it significantly less repetitive, as the majority of lines are correct.

So you mean instead of validating the transcriptions the editors would be moving the polygons around the image?

martinteklia commented 2 years ago

Some statistics of the transcription correction/validation process:

folder_name manual_trans_line_count total_line_count completed_pages started_pages total_pages
BGBM 320 320 14 14 14
Kew 33 548 1 1 20
Luomus 29 534 1 1 20
MNHN 0 322 0 0 20
MeiseBG 43 4268 1 3 200
NHM London 449 461 14 20 20
Naturalis 120 539 6 7 20
RBGE 331 4627 10 17 200
Tartu 1 300 0 1 19
Total 1326 11919 47 64 533
martinteklia commented 2 years ago

Also the gold standard transcriptions have been copied to the annotated text_line elements where possible, for example (s8+ GCV Gold Standard) https://arkindex.teklia.com/element/300fdafc-c75e-4161-9a79-ec9f28a4ed9b?highlight=878a171e-5fa4-43f7-95fb-1ec92622ab2f

matdillen commented 2 years ago

All the text boxes there have been made by humans. Only best matching text box is chosen to put the GCV annotation - there is no point to transcribe everything twice.

Okay, so I'll just delete the duplicated text lines and keep the ones with GCV transcriptions.

So you mean instead of validating the transcriptions the editors would be moving the polygons around the image?

No, for every GCV transcription, there would already be a 'manual' annotation. The validators don't have to add it each time, they only need to edit it if needed and add manual annotations to text missed by GCV.

So, instead of clicking A+ and copy each time, these steps would have been done already as a batch operation. Validators can see both transcriptions in the side pane and only need to edit them if they're incorrect.

If this is feasible in ArkIndex, I think the only downside is that it complicates keeping track of validator work.

martinteklia commented 2 years ago

No, for every GCV transcription, there would already be a 'manual' annotation. The validators don't have to add it each time, they only need to edit it if needed and add manual annotations to text missed by GCV.

So, instead of clicking A+ and copy each time, these steps would have been done already as a batch operation. Validators can see both transcriptions in the side pane and only need to edit them if they're incorrect.

If this is feasible in ArkIndex, I think the only downside is that it complicates keeping track of validator work.

Yes the main reason why we're doing it like this is because otherwise it's hard to tell whether a transcription was already correct and didn't need any changes or the annotator just forgot to correct.

Another way would be to add a class to lines that are already correct validated_transcription.

Or the class could be on page level and we would trust that the annotator has validated (and corrected if necessary) all the transcriptions on the page if they add that class to the page.

martinteklia commented 2 years ago
* I can't seem to add unclear_transcription. I can find it in the dropdown, but clicking Add doesn't do anything. I wanted to add this for the two lines which are partially covered by the plant, and which the algorithm get more or less right at least for the text it could see.

It is fixed now. Do a hard refresh on the page to update it. (The version at the bottom of the page should be 1.1.3-p1).

emhaston commented 2 years ago

We now have some specimens where the s8+ GCV Output and the Gold Standard are both incorrect - picking up text from outside the polygon for example.

In these examples, we are copying and editing the original s8+ GCV Output and leaving the Gold Standard Output as is.

image

martinteklia commented 2 years ago

Yes, you're supposed to only create one manual transcription. The gold standard was added in case it's already correct, so you would need to only click on the copy button and save time by not correcting the transcription.

Cubey0 commented 2 years ago

If gold (or any existing) is correct do we still need to create manual?

emhaston commented 2 years ago

We've now completed the first page of the RBGE dataset - 20 specimen records. The OCR has all been checked, a manual copy created, corrected where necessary.

martinteklia commented 2 years ago

If gold (or any existing) is correct do we still need to create manual?

Yes, because otherwise it's hard to tell if the transcription is correct or just forgotten to be validated. (It's also said in step 6 of the guide at the beginning of the issue)

martinteklia commented 2 years ago

We've now completed the first page of the RBGE dataset - 20 specimen records. The OCR has all been checked, a manual copy created, corrected where necessary.

There are 10 text_line elements without a manual or gold standard (s8+ GCV Gold Standard) transcription in the 20 pages of RGBE:

Progress stats

There's an extra column gold_trans_line_count - the number of lines with gold standard transcription, but no manual transcription.

Also a page is considered completed if all the lines either have a manual transcription or gold standard transcription (s8+ GCV Gold Standard).

So if the gold standard transcription is correct you don't need to create a manual copy.

folder_name manual_trans_line_count gold_trans_line_count total_line_count completed_pages started_pages total_pages
BGBM 320 0 320 14 14 14
Kew 33 155 548 1 1 20
Luomus 29 92 534 1 1 20
MNHN 0 0 322 0 0 20
MeiseBG 43 541 4268 1 3 200
NHM London 449 9 461 17 20 20
Naturalis 120 177 539 6 7 20
RBGE 449 566 4627 16 20 200
Tartu 1 0 300 0 1 19
kris-loh commented 2 years ago

@kris-loh can you prioritise MNHN, Tartu and Naturalis? I will prioritise NHM, Kew and Luomus.

@llivermore I finished MNHN, Naturalis and have 10 left for Tartu. Would you like me to do some more?

@martinteklia I accidentally deleted two "s8+ GCV" transcription. The Copy sign is right next to the delete and I accidentally clicked on the wrong one.

martinteklia commented 2 years ago

@martinteklia I accidentally deleted two "s8+ GCV" transcription. The Copy sign is right next to the delete and I accidentally clicked on the wrong one.

As long as there's a manual transcription, it's not a problem.

Cubey0 commented 2 years ago

These 10 are now sorted.

"There are 10 text_line elements without a manual or gold standard (s8+ GCV Gold Standard) transcription in the 20 pages of RGBE:"

martinteklia commented 2 years ago

Progress stats:

folder_name manual_trans_line_count gold_trans_line_count total_line_count completed_pages started_pages total_pages
BGBM 320 0 320 14 14 14
Kew 33 155 548 1 1 20
Luomus 29 92 534 1 1 20
MNHN 319 0 322 17 20 20
MeiseBG 43 541 4268 1 3 200
NHM London 449 9 461 17 20 20
Naturalis 493 0 499 15 20 20
RBGE 460 566 4627 20 20 200
Tartu 299 0 301 18 19 19
Total 2445 1363 11880 104 118 533
martinteklia commented 2 years ago

Progress stats:

folder_name manual_trans_line_count gold_trans_line_count total_line_count completed_pages started_pages total_pages
BGBM 320 0 320 14 14 14
Kew 33 155 548 1 1 20
Luomus 29 92 534 1 1 20
MNHN 319 0 322 17 20 20
MeiseBG 547 539 4235 22 30 200
NHM London 449 9 461 17 20 20
Naturalis 493 0 499 15 20 20
RBGE 460 566 4627 20 20 200
Tartu 299 0 301 18 19 19
Total 2949 1361 11847 125 145 533
llivermore commented 2 years ago

We need more pages and transcribed lines (double). The focus should be on for lines of handwritten text.

emhaston commented 2 years ago

Bit of a stupid question. If we pick and choose pages for handwritten text and they are scattered through the batch, I'm assuming that is not a problem for you knowing what has been done?

martinteklia commented 2 years ago

Bit of a stupid question. If we pick and choose pages for handwritten text and they are scattered through the batch, I'm assuming that is not a problem for you knowing what has been done?

In the training data to detect the text line polygons, all the text lines must be annotated, because the full page is used as input to the machine learning model. If some lines are not annotated it will be confusing to the model.

However, for the text recognition model the training inputs are line images with the corresponding transcriptions. So it won't be problem, because we'll just ignore the lines without a manual/(aligned gold standard) transcription.

emhaston commented 2 years ago

For RBGE: focussing on handwriting Each to pick out and transcribe the handwritten labels from the following pages Rob: 2-4 Robyn: 5-7 (done) Elspeth: page 8 (141-160) (141-143 done) page 9 (161-180) (done) page 10 (181-200) (done)

jblettery commented 2 years ago

I am available to do verification/transcription. Which pages should be prioritised?

Cubey0 commented 2 years ago

My understanding was pages with handwritten text should be prioritised as OCR of test is a "known" model.

martinteklia commented 2 years ago

Yes, recognizing handwritten text is harder than printed text.

jblettery commented 2 years ago

RBGE page 2 and 3: transcriptions of handwritings are made.

Overall, "s8+ GCV Gold Standard" has better intuition than I do - when I can't transcribe I put unclear_transcription but in reality the Gold Standard probably comes very close to a good transcription.

martinteklia commented 2 years ago

The "s8+ GCV Gold Standard" transcriptions have been corrected by humans, so the quality is supposed to be quite high. However, the polygons from google vision were not corrected and they aren't very good (covering multiple lines).

We tried to match the google vision output to the text line polygons made by humans, but the matching is not 100% correct - that's why we need them to be validated (and corrected if necessary).

The missing polygon for geographical coordinates is just a human error - the polygons are made by humans. You can draw them yourself, if they are missing.

kris-loh commented 2 years ago

@martinteklia I am currently doing more, but looking at the numbers I have some questions. What makes a page completed? From what we were doing before Christmas, we should have 20 completed for all the institutes other than BGBM, but I can see there are less. We have 20 started pages, but less than 20 completed. Can you see which pages are the incomplete? (It would help to finish those)

martinteklia commented 2 years ago

@martinteklia I am currently doing more, but looking at the numbers I have some questions. What makes a page completed?

A page is considered completed if all the lines either have a manual transcription or a gold standard transcription (s8+ GCV Gold Standard).

From what we were doing before Christmas, we should have 20 completed for all the institutes other than BGBM, but I can see there are less. We have 20 started pages, but less than 20 completed. Can you see which pages are the incomplete? (It would help to finish those)

Updated progress table:

folder_name manual_trans_line_count gold_trans_line_count total_line_count completed_pages started_pages total_pages
BGBM 1000 0 1052 40 49 49
Kew 33 155 548 1 1 20
Luomus 136 30 534 3 9 20
MNHN 319 0 322 17 20 20
MeiseBG 769 528 4235 38 47 200
NHM London 958 9 982 30 40 40
Naturalis 493 0 499 15 20 20
RBGE 1449 421 4627 51 84 200
Tartu 299 0 301 18 19 19
Total 5456 1143 13100 213 289 588

Links to the BGBM lines without transcription are in the attachment.

BGBM.txt

martinteklia commented 2 years ago

An HTR tool was run on the pages where there were text_line elements without any manual transcription. If you could validate/correct the transcriptions on handwritten lines it would be great.

Most pages from this point forward don't have manual transcriptions:

kris-loh commented 2 years ago

Hi @martinteklia I can see the changes on the Teklia website. There is one I don't know how to make it work. So before, when you wanted to transcribe a text box, if you clicked on it, it popped up in an extra window. It doesn't do that anymore, that means when the text is upside down for example, even if I put it "Rotation 180°", nothing happens, I still have to transcribe from looking at the text upside down. Am I doing something wrong?

martinteklia commented 2 years ago

Hi @kris-loh

Yes, now it's possible to transcribe directly in the details pane on the right, so the user would need to make fewer clicks.

However, as you said, it doesn't support rotated lines.

To transcribe rotated lines, you can click on the Manage button. It will open an extra window as before and it should display correctly. https://arkindex.teklia.com/element/ef75c607-9522-4025-88f0-1d3c527f577d?highlight=1692e74b-63c7-4d39-9343-b8d8a91151bc manage_transcription_screenshot

kris-loh commented 2 years ago

@martinteklia Thank you for the help!

llivermore commented 1 year ago

We need to publish and document the updates to this dataset (ideally) before the completion of the project.

kermorvant commented 1 year ago

we have a new interface for data validation that we would use if we keep creating ground-truth : https://teklia.com/blog/202209-callico/