DiSSCo / SDR

Specimen Data Refinery
Apache License 2.0
7 stars 0 forks source link

Annotate text lines in herbarium sheet dataset #7

Closed llivermore closed 2 years ago

llivermore commented 3 years ago

Training images of herbarium sheets, with:

Need to discuss outstanding work with Teklia and Mathias Dillen.

martinteklia commented 2 years ago

@martinteklia can we discuss how best to align the google vision output to train a transcription model on the coming Wednesday meeting? I presume this will be an additional manual step where we check and paste the transcriptions into the text lines? It would be good to work on this between the next two meetings (10th - 24th November) but it may take longer.

We have already made a script to align the google vision output automatically to the text lines. It should work. However, the transcriptions from google vision are not always correct, so they would need to be validated/corrected manually.

Cubey0 commented 2 years ago

Why are we using Google vision for training? would not hand key transcription be better? or Is it to show OCR usage?

This dataset has both Google vision https://github.com/DiSSCo/sdr-datasets/pull/3 and hand keyed transcription data https://zenodo.org/record/3697797#.YYVAYGDP2kw

kris-loh commented 2 years ago

@martinteklia Hi there! I annotated some specimens and transcribed the information. I have two questions regarding the transcriptions. 1) on herbarium sheets it happens occasionally, that text is crossed out. How should we mark that the transcribed text was crossed out, does it needs to be transcribed? should we just ignored a crossed line?

2) with difficult handwritings, sometimes I can't confidently write the words. How should we mark the ones that are illegible or uncertain transcription?

martinteklia commented 2 years ago

@martinteklia Hi there! I annotated some specimens and transcribed the information. I have two questions regarding the transcriptions.

There is no need to transcribe them at the moment - it will be done later (semi-automatically). So transcribing can be discussed later.

1. on herbarium sheets it happens occasionally, that text is crossed out. How should we mark that the transcribed text was crossed out, does it needs to be transcribed? should we just ignored a crossed line?

Crossed out parts should not be transcribed. If the entirety of the line is crossed out, then yes it should be left empty (without a transcription).

2. with difficult handwritings, sometimes I can't confidently write the words. How should we mark the ones that are illegible or uncertain transcription?

If you're 90% sure then you should transcribe your best guess. There will be human errors anyway in the transcriptions. If there's a single letter error a vs o for example, it might not be that big of a deal. If it actually is illegible you should transcribe it as ##unclear##.

Anyways, at this point you should just draw the text_line polygons. The transcriptions will come later.

martinteklia commented 2 years ago

Why are we using Google vision for training? would not hand key transcription be better? or Is it to show OCR usage?

This dataset has both Google vision DiSSCo/sdr-datasets#3 and hand keyed transcription data https://zenodo.org/record/3697797#.YYVAYGDP2kw

To train the HTR model we need text line images with the corresponding transcriptions. I don't see the polygons in the keyed transcription data, so how to know where the transcription is situated on the sheet image?

The keyed transcription data could be used for evaluation, to see if they exist in the transcriptions made by the model.

Cubey0 commented 2 years ago

So there will be a "matching-up" of google vision polygons (with OCR) to the ones we are manually putting onto the specimens with Arkindex?

martinteklia commented 2 years ago

Yes. The line polygons from google vision weren't that great. Now we have good polygons from humans that we can use to train a line segmentation model.

Also, we can use the word polygons from google vision to try to match them to the line polygons made by humans and thereby get transcriptions with good polygons. Afterwards, the transcriptions will be validated/corrected and we'll have good transcriptions and polygons.

kris-loh commented 2 years ago

@martinteklia Is there something wrong with the website? I was doing an image when I started getting the error "Bad gateway" and now it shows me the following message: "Server unreachable A connection error occurred while fetching authentication data from the Arkindex server. Please try again later."

llivermore commented 2 years ago

@martinteklia if we have sufficient text line annotation then I suggest we discuss and create a separate issue for the next steps with some screenshots and instructions (either linked or in the issue itself). I will leave this issue open until I receive confirmation from you.

martinteklia commented 2 years ago

@martinteklia Is there something wrong with the website? I was doing an image when I started getting the error "Bad gateway" and now it shows me the following message: "Server unreachable A connection error occurred while fetching authentication data from the Arkindex server. Please try again later."

Yes there was some issue with the server, it should be ok now.

martinteklia commented 2 years ago

@martinteklia if we have sufficient text line annotation then I suggest we discuss and create a separate issue for the next steps with some screenshots and instructions (either linked or in the issue itself). I will leave this issue open until I receive confirmation from you.

Yes, fine by me.

Final count:

institution annotated_pages_count
BGBM 12
Kew 20
Luomus 20
MNHN 20
MeiseBG 98
NHM London 20
Naturalis 20
RBGE 200
Tartu 19
Total 429
martinteklia commented 2 years ago

Updated final (?) count:

institution annotated_pages_count
BGBM 12
Kew 20
Luomus 20
MNHN 20
MeiseBG 171
NHM London 20
Naturalis 20
RBGE 200
Tartu 19
Total 502
matdillen commented 2 years ago

I think the MeiseBG specimens should all be complete now as well.

martinteklia commented 2 years ago

I think the MeiseBG specimens should all be complete now as well.

Actually, there are 2 MeiseBG pages with no annotations:

matdillen commented 2 years ago

I think the MeiseBG specimens should all be complete now as well.

Actually, there are 2 MeiseBG pages with no annotations:

* https://arkindex.teklia.com/element/d4b7c902-efd7-4b41-aefb-b7828344c569

* https://arkindex.teklia.com/element/ce83a0c7-9baa-41a4-8077-69d985b02eca

Thanks for pointing them out, I just annotated them for completion's sake.

llivermore commented 2 years ago

This part of the herbarium dataset work is now complete. Next stage is in #68