Validate/correct text line transcriptions in herbarium sheet dataset

martinteklia commented 2 years ago

The google vision output has been copied to the text lines you had annotated. Sometimes the matching of the polygons failed, or there were no transcriptions from google vision, but most of the text lines should have a transcription.

Now the goal is to validate (and correct if necessary) the text line transcriptions.

Here are some transcription guidelines:

Transcribe verbatim - don't expand abbreviations
Don't use markup in the transcriptions like in [London?] in January 1873 to mark uncertainty. If you get one letter wrong it's probably not a big deal, because there will be human errors anyway. If however, there's a bigger part of the line that you are unable to decipher, then add a classification unclear_transcription to the text_line element. (step 9 in the following screenshot list)
Crossed out words should be ignored
When correcting transcriptions - no need to correct spaces around punctuation or other symbols
If the google vision transcription is already correct, then to validate means to click on the copy button (step 5 in the following screenshot list)

Guide for validating/correcting text line transcriptions on Arkindex

Filter to have only text_line elements

Display the elements on the image

Start clicking on the text line elements on the list on the left

Click on the A+ button to validate/correct the transcriptions or on Add or edit a manual transcription in the right pane
In the transcription modal click on the copy button to copy it into a manual transcription that can be modified

If the transcription was already correct then that's it - it has been validated.
If it wasn't correct then click on the crayon button to edit the manual transcription

If the line doesn't have any transcription from google vision then you can add it yourself

If the you're unable to transcribe then add a class unclear_transcription to the text_line

new_1

new_3

martinteklia commented 2 years ago

@llivermore could you take a look if the guide is clear enough?

llivermore commented 2 years ago

@martinteklia it looks clear to me - I/we can update the project team in the meeting today and work on the lines during the next sprint.

martinteklia commented 2 years ago

A classification from_gold_standard_250 has been added to the sheets, that come from the gold standard. Don't correct these yet, because I will try to match the transcriptions from the gold standard to the new polygons and hopefully there will be less corrections needed.

llivermore commented 2 years ago

Hi @matdillen, @Cubey0, @emhaston, @droepert, and @kris-loh are you able to validate and correct text line transcriptions in the herbarium sheet dataset? I am hoping to get these done over Christmas.

llivermore commented 2 years ago

@kris-loh can you prioritise MNHN, Tartu and Naturalis? I will prioritise NHM, Kew and Luomus.

matdillen commented 2 years ago

@martinteklia I have a few questions, after doing one of the tasks:

I can't seem to add unclear_transcription. I can find it in the dropdown, but clicking Add doesn't do anything. I wanted to add this for the two lines which are partially covered by the plant, and which the algorithm get more or less right at least for the text it could see.
The barcode interprets a poorly printed 0 as a ). I assume it's fine to leave this transcription unedited?
There is a weird symbol in the text, an ε with a vertical dash around it. This seems to be a written variant of &, interpreted by the algorithm as an e. Is it fine to leave edge cases like this unedited?
Tasks 1 and 2 for MeiseBG seem to have 2 text boxes for every line. Only one of the two has the GCV annotations.

Aside from this, I wonder if it is feasible to bulk copy all the manual transcriptions to all specimens not covered by from_gold_standard_250, but still make them editable by the validators (i.e. steps 4-5)? Then the validators would only have to edit this one if incorrect, rather than make all the clicks. I believe it may speed up the process and make it significantly less repetitive, as the majority of lines are correct.

martinteklia commented 2 years ago

@martinteklia I have a few questions, after doing one of the tasks:

* I can't seem to add unclear_transcription. I can find it in the dropdown, but clicking Add doesn't do anything. I wanted to add this for the two lines which are partially covered by the plant, and which the algorithm get more or less right at least for the text it could see.

Yes, there's a bug with adding a class. Thanks for pointing it out! We're working on fixing it.

* The barcode interprets a poorly printed `0` as a `)`. I assume it's fine to leave this transcription unedited?

Yes.

* There is a weird symbol in the text, an `ε` with a vertical dash around it. This seems to be a written variant of `&`, interpreted by the algorithm as an `e`. Is it fine to leave edge cases like this unedited?

If there's a lot of them, it would be better to use &, but if there's only a few then it can be kept unedited.

* Tasks [1](https://arkindex.teklia.com/element/f5238cd2-a89b-4fed-a44f-634f3bb7736d) and [2](https://arkindex.teklia.com/element/90b87db8-253e-4c91-aaf5-d7c8fa1f414c) for MeiseBG seem to have 2 text boxes for every line. Only one of the two has the GCV annotations.

All the text boxes there have been made by humans. Only best matching text box is chosen to put the GCV annotation - there is no point to transcribe everything twice.

Aside from this, I wonder if it is feasible to bulk copy all the manual transcriptions to all specimens not covered by from_gold_standard_250, but still make them editable by the validators (i.e. steps 4-5)? Then the validators would only have to edit this one if incorrect, rather than make all the clicks. I believe it may speed up the process and make it significantly less repetitive, as the majority of lines are correct.

So you mean instead of validating the transcriptions the editors would be moving the polygons around the image?

martinteklia commented 2 years ago

Some statistics of the transcription correction/validation process:

manual_trans_line_count is the number of manual text_line transcriptions
total_line_count is the number of text_line polygons drawn by you in the previous annotation task
a page is considered completed if every text_line has a manual transcription
a page is considered started if at least one text_line has a manual transcription
total_pages is the number of pages on which there are text_line polygons drawn by the annotators

folder_name	manual_trans_line_count	total_line_count	completed_pages	started_pages	total_pages
BGBM	320	320	14	14	14
Kew	33	548	1	1	20
Luomus	29	534	1	1	20
MNHN	0	322	0	0	20
MeiseBG	43	4268	1	3	200
NHM London	449	461	14	20	20
Naturalis	120	539	6	7	20
RBGE	331	4627	10	17	200
Tartu	1	300	0	1	19
Total	1326	11919	47	64	533

martinteklia commented 2 years ago

Also the gold standard transcriptions have been copied to the annotated text_line elements where possible, for example (s8+ GCV Gold Standard) https://arkindex.teklia.com/element/300fdafc-c75e-4161-9a79-ec9f28a4ed9b?highlight=878a171e-5fa4-43f7-95fb-1ec92622ab2f

matdillen commented 2 years ago

All the text boxes there have been made by humans. Only best matching text box is chosen to put the GCV annotation - there is no point to transcribe everything twice.

Okay, so I'll just delete the duplicated text lines and keep the ones with GCV transcriptions.

So you mean instead of validating the transcriptions the editors would be moving the polygons around the image?

No, for every GCV transcription, there would already be a 'manual' annotation. The validators don't have to add it each time, they only need to edit it if needed and add manual annotations to text missed by GCV.

So, instead of clicking A+ and copy each time, these steps would have been done already as a batch operation. Validators can see both transcriptions in the side pane and only need to edit them if they're incorrect.

If this is feasible in ArkIndex, I think the only downside is that it complicates keeping track of validator work.

martinteklia commented 2 years ago

No, for every GCV transcription, there would already be a 'manual' annotation. The validators don't have to add it each time, they only need to edit it if needed and add manual annotations to text missed by GCV.

So, instead of clicking A+ and copy each time, these steps would have been done already as a batch operation. Validators can see both transcriptions in the side pane and only need to edit them if they're incorrect.

If this is feasible in ArkIndex, I think the only downside is that it complicates keeping track of validator work.

Yes the main reason why we're doing it like this is because otherwise it's hard to tell whether a transcription was already correct and didn't need any changes or the annotator just forgot to correct.

Another way would be to add a class to lines that are already correct validated_transcription.

Or the class could be on page level and we would trust that the annotator has validated (and corrected if necessary) all the transcriptions on the page if they add that class to the page.

martinteklia commented 2 years ago

* I can't seem to add unclear_transcription. I can find it in the dropdown, but clicking Add doesn't do anything. I wanted to add this for the two lines which are partially covered by the plant, and which the algorithm get more or less right at least for the text it could see.

It is fixed now. Do a hard refresh on the page to update it. (The version at the bottom of the page should be 1.1.3-p1).

emhaston commented 2 years ago

We now have some specimens where the s8+ GCV Output and the Gold Standard are both incorrect - picking up text from outside the polygon for example.

In these examples, we are copying and editing the original s8+ GCV Output and leaving the Gold Standard Output as is.

martinteklia commented 2 years ago

Yes, you're supposed to only create one manual transcription. The gold standard was added in case it's already correct, so you would need to only click on the copy button and save time by not correcting the transcription.

Cubey0 commented 2 years ago

If gold (or any existing) is correct do we still need to create manual?

emhaston commented 2 years ago

We've now completed the first page of the RBGE dataset - 20 specimen records. The OCR has all been checked, a manual copy created, corrected where necessary.

martinteklia commented 2 years ago

If gold (or any existing) is correct do we still need to create manual?

Yes, because otherwise it's hard to tell if the transcription is correct or just forgotten to be validated. (It's also said in step 6 of the guide at the beginning of the issue)

martinteklia commented 2 years ago

We've now completed the first page of the RBGE dataset - 20 specimen records. The OCR has all been checked, a manual copy created, corrected where necessary.

There are 10 text_line elements without a manual or gold standard (s8+ GCV Gold Standard) transcription in the 20 pages of RGBE:

Progress stats

There's an extra column gold_trans_line_count - the number of lines with gold standard transcription, but no manual transcription.

Also a page is considered completed if all the lines either have a manual transcription or gold standard transcription (s8+ GCV Gold Standard).

So if the gold standard transcription is correct you don't need to create a manual copy.

folder_name	manual_trans_line_count	gold_trans_line_count	total_line_count	completed_pages	started_pages	total_pages
BGBM	320	0	320	14	14	14
Kew	33	155	548	1	1	20
Luomus	29	92	534	1	1	20
MNHN	0	0	322	0	0	20
MeiseBG	43	541	4268	1	3	200
NHM London	449	9	461	17	20	20
Naturalis	120	177	539	6	7	20
RBGE	449	566	4627	16	20	200
Tartu	1	0	300	0	1	19

kris-loh commented 2 years ago

@kris-loh can you prioritise MNHN, Tartu and Naturalis? I will prioritise NHM, Kew and Luomus.

@llivermore I finished MNHN, Naturalis and have 10 left for Tartu. Would you like me to do some more?

@martinteklia I accidentally deleted two "s8+ GCV" transcription. The Copy sign is right next to the delete and I accidentally clicked on the wrong one.

martinteklia commented 2 years ago

@martinteklia I accidentally deleted two "s8+ GCV" transcription. The Copy sign is right next to the delete and I accidentally clicked on the wrong one.

As long as there's a manual transcription, it's not a problem.

Cubey0 commented 2 years ago

These 10 are now sorted.

"There are 10 text_line elements without a manual or gold standard (s8+ GCV Gold Standard) transcription in the 20 pages of RGBE:"

martinteklia commented 2 years ago

Progress stats:

folder_name	manual_trans_line_count	gold_trans_line_count	total_line_count	completed_pages	started_pages	total_pages
BGBM	320	0	320	14	14	14
Kew	33	155	548	1	1	20
Luomus	29	92	534	1	1	20
MNHN	319	0	322	17	20	20
MeiseBG	43	541	4268	1	3	200
NHM London	449	9	461	17	20	20
Naturalis	493	0	499	15	20	20
RBGE	460	566	4627	20	20	200
Tartu	299	0	301	18	19	19
Total	2445	1363	11880	104	118	533

martinteklia commented 2 years ago

Progress stats:

folder_name	manual_trans_line_count	gold_trans_line_count	total_line_count	completed_pages	started_pages	total_pages
BGBM	320	0	320	14	14	14
Kew	33	155	548	1	1	20
Luomus	29	92	534	1	1	20
MNHN	319	0	322	17	20	20
MeiseBG	547	539	4235	22	30	200
NHM London	449	9	461	17	20	20
Naturalis	493	0	499	15	20	20
RBGE	460	566	4627	20	20	200
Tartu	299	0	301	18	19	19
Total	2949	1361	11847	125	145	533

llivermore commented 2 years ago

We need more pages and transcribed lines (double). The focus should be on for lines of handwritten text.

emhaston commented 2 years ago

Bit of a stupid question. If we pick and choose pages for handwritten text and they are scattered through the batch, I'm assuming that is not a problem for you knowing what has been done?

martinteklia commented 2 years ago

Bit of a stupid question. If we pick and choose pages for handwritten text and they are scattered through the batch, I'm assuming that is not a problem for you knowing what has been done?

In the training data to detect the text line polygons, all the text lines must be annotated, because the full page is used as input to the machine learning model. If some lines are not annotated it will be confusing to the model.

However, for the text recognition model the training inputs are line images with the corresponding transcriptions. So it won't be problem, because we'll just ignore the lines without a manual/(aligned gold standard) transcription.

emhaston commented 2 years ago

For RBGE: focussing on handwriting Each to pick out and transcribe the handwritten labels from the following pages Rob: 2-4 Robyn: 5-7 (done) Elspeth: page 8 (141-160) (141-143 done) page 9 (161-180) (done) page 10 (181-200) (done)

jblettery commented 2 years ago

I am available to do verification/transcription. Which pages should be prioritised?

Cubey0 commented 2 years ago

My understanding was pages with handwritten text should be prioritised as OCR of test is a "known" model.

martinteklia commented 2 years ago

Yes, recognizing handwritten text is harder than printed text.

jblettery commented 2 years ago

RBGE page 2 and 3: transcriptions of handwritings are made.

Some lines of text have not been covered by s8+ GCV Gold Standard
page 3: some sheet do not have a preview
Sometimes, s8+ GCV Gold Standard takes two lines and mixes them into one often (ex: E00117382 Text-line13 - 14)
sometimes missing selection boxes (e.g. geographical coordinates in sheet E00112411

Overall, "s8+ GCV Gold Standard" has better intuition than I do - when I can't transcribe I put unclear_transcription but in reality the Gold Standard probably comes very close to a good transcription.

martinteklia commented 2 years ago

The "s8+ GCV Gold Standard" transcriptions have been corrected by humans, so the quality is supposed to be quite high. However, the polygons from google vision were not corrected and they aren't very good (covering multiple lines).

We tried to match the google vision output to the text line polygons made by humans, but the matching is not 100% correct - that's why we need them to be validated (and corrected if necessary).

The missing polygon for geographical coordinates is just a human error - the polygons are made by humans. You can draw them yourself, if they are missing.

kris-loh commented 2 years ago

@martinteklia I am currently doing more, but looking at the numbers I have some questions. What makes a page completed? From what we were doing before Christmas, we should have 20 completed for all the institutes other than BGBM, but I can see there are less. We have 20 started pages, but less than 20 completed. Can you see which pages are the incomplete? (It would help to finish those)

martinteklia commented 2 years ago

@martinteklia I am currently doing more, but looking at the numbers I have some questions. What makes a page completed?

A page is considered completed if all the lines either have a manual transcription or a gold standard transcription (s8+ GCV Gold Standard).

From what we were doing before Christmas, we should have 20 completed for all the institutes other than BGBM, but I can see there are less. We have 20 started pages, but less than 20 completed. Can you see which pages are the incomplete? (It would help to finish those)

Updated progress table:

folder_name	manual_trans_line_count	gold_trans_line_count	total_line_count	completed_pages	started_pages	total_pages
BGBM	1000	0	1052	40	49	49
Kew	33	155	548	1	1	20
Luomus	136	30	534	3	9	20
MNHN	319	0	322	17	20	20
MeiseBG	769	528	4235	38	47	200
NHM London	958	9	982	30	40	40
Naturalis	493	0	499	15	20	20
RBGE	1449	421	4627	51	84	200
Tartu	299	0	301	18	19	19
Total	5456	1143	13100	213	289	588

Links to the BGBM lines without transcription are in the attachment.

BGBM.txt

martinteklia commented 2 years ago

An HTR tool was run on the pages where there were text_line elements without any manual transcription. If you could validate/correct the transcriptions on handwritten lines it would be great.

Most pages from this point forward don't have manual transcriptions:

kris-loh commented 2 years ago

Hi @martinteklia I can see the changes on the Teklia website. There is one I don't know how to make it work. So before, when you wanted to transcribe a text box, if you clicked on it, it popped up in an extra window. It doesn't do that anymore, that means when the text is upside down for example, even if I put it "Rotation 180°", nothing happens, I still have to transcribe from looking at the text upside down. Am I doing something wrong?

martinteklia commented 2 years ago

Hi @kris-loh

Yes, now it's possible to transcribe directly in the details pane on the right, so the user would need to make fewer clicks.

However, as you said, it doesn't support rotated lines.

To transcribe rotated lines, you can click on the Manage button. It will open an extra window as before and it should display correctly. https://arkindex.teklia.com/element/ef75c607-9522-4025-88f0-1d3c527f577d?highlight=1692e74b-63c7-4d39-9343-b8d8a91151bc manage_transcription_screenshot

kris-loh commented 2 years ago

@martinteklia Thank you for the help!

llivermore commented 1 year ago

We need to publish and document the updates to this dataset (ideally) before the completion of the project.

kermorvant commented 1 year ago

we have a new interface for data validation that we would use if we keep creating ground-truth : https://teklia.com/blog/202209-callico/

DiSSCo / SDR