DiSSCo / SDR

Specimen Data Refinery
Apache License 2.0
6 stars 0 forks source link

Annotate text lines in herbarium sheet dataset #7

Closed llivermore closed 2 years ago

llivermore commented 3 years ago

Training images of herbarium sheets, with:

Need to discuss outstanding work with Teklia and Mathias Dillen.

llivermore commented 3 years ago

Need to segment the specimens - can this be automated? MD to trial

benscott commented 3 years ago

Created repository for training datasets: https://github.com/DiSSCo/sdr-datasets (cc @matdillen )

matdillen commented 3 years ago

The dataset is now available at https://github.com/matdillen/sdr-datasets/tree/main/herbarium I forked the repo as I had no write access to the DiSSCo one.

The annotated-properties-v2.json file contains annotations with Darwin Core terms and more explicitly with the six entities I emphasized in #4 . Some validation may still be needed, e.g. I noticed that verbatim event dates of S.D. cause false positives.

Cubey0 commented 3 years ago

Am I correct that the herbarium dataset will get the same style of entry screen and mark-up options as a pinned insect dataset?

If we are using the “gold standard ISDIG” data is the plan to have each partner Institution annotate our own specimens within this dataset?

We have installed a local version of Label Studio if required.

matdillen commented 3 years ago

@Cubey0 I've annotated the herbarium dataset (n=250) with six different entities, based on Darwin Core data. See v3 in that same folder. This will not be perfect and some validation would be useful. I've been in touch with @martinteklia and I think he will import these data into ArkIndex. From there they could be migrated to Label Studio, as shown during last meeting, where they can be validated (I presume, I've never used label studio).

It's also possible to annotate more images than those 250. This could be facilitated similarly with Google Vision results and Darwin Core matching, although the quality of this will be lower as Google Vision errors will not have been addressed (unlike the 250). So it will be more work.

Cubey0 commented 3 years ago

From the last meeting, I think that @martinteklia said we may need a few more 250.

@emhaston and I have both had a trial run with Label Studio (I wanted to "give it a whirl" because I am planning on using it on another long-term project to digitise the handwritten living collection index cards) and we feel we could easily mark-up the RBGE component of the data set - or we can assist with the validation.

Cubey0 commented 3 years ago

Here is the ABBY OCR from the 200 (RBGE) Edinburgh Specimens in the ISDIG "Gold Standard" dataset.

I'm not sure how it was going to be used, but it was mentioned during the zoom call on 26/5/2021

OCR (Abby) of RBGE part of Gold Standard dataset.xlsx

emhaston commented 2 years ago

Our understanding is now that the following steps will take place:

  1. The complete ICEDIG dataset will be uploaded to Arkindex.
  2. The OCR output data that @matdillen has from GoogleVision will also be uploaded.
  3. Access will be given in the first instance to RBGE, Meise and possibly Kew.
  4. We will check the line transcriptions to reduce having to reimport data into LabelStudio multiple times due to transcription errors.
  5. The data will then be imported into LabelStudio.
  6. We will then annotate the records with named entities.

Is that correct?

Cubey0 commented 2 years ago

I know @matdillen has GoogleViosion OCR for this dataset but for the RBGE component in the ICEDIG Gold standard dataset, we can also provide OCR via both ABBY and Tesseract 10 as we used it to do an in-house OCR software comparison.

matdillen commented 2 years ago

I've made the raw Google Vison results available, see pull request here DiSSCo/sdr-datasets/pull/3

I had quite a few errors back when I did the processing, so there are only 1792 responses, not 1800.

martinteklia commented 2 years ago

Our understanding is now that the following steps will take place:

1. The complete ICEDIG dataset will be uploaded to Arkindex.

2. The OCR output data that @matdillen has from GoogleVision will also be uploaded.

3. Access will be given in the first instance to RBGE, Meise and possibly Kew.

4. We will check the line transcriptions to reduce having to reimport data into LabelStudio multiple times due to transcription errors.

5. The data will then be imported into LabelStudio.

6. We will then annotate the records with named entities.

Is that correct?

Yes. After we have imported the images and the OCR transcriptions we'll have to see how good the quality of the OCR output is. It might be easier/quicker to re-transcribe than to correct.

llivermore commented 2 years ago

@martinteklia have you been able to import the images? Do you need anything more from us to get the images into Arkindex?

martinteklia commented 2 years ago

@llivermore yes we have imported the images. We are working on a document to describe how they should be annotated.

martinteklia commented 2 years ago

@llivermore how many annotators will there be? We will need to create accounts for them so they could annotate.

Cubey0 commented 2 years ago

@emhaston @Cubey0 From RBGE please.

(I guess others from other institutions ?)

qgroom commented 2 years ago

@matdillen are you our annotator or is this something we could ask for a technician to do?

droepert commented 2 years ago

@infinite-dao @droepert for BGBM

matdillen commented 2 years ago

@matdillen are you our annotator or is this something we could ask for a technician to do?

Let's leave mine for now. Unless you want to have a go as well?

martinteklia commented 2 years ago

We created accounts for the people mentioned here and sent the guide by email.

If you didn't receive the mail, let me know.

matdillen commented 2 years ago

@martinteklia Is it only the first ten pages (200 images) for now? Later images seem to give errors from time to time, e.g. iiiif-error-arkindex

We have some technicians who can work on this, but they would prefer to work mainly on Meise (BR) specimens.

martinteklia commented 2 years ago

@martinteklia Is it only the first ten pages (200 images) for now?

Yes the idea was to have a subset with high quality human annotations for training and evaluation. By only annotating text line polygons on a subset of the pages, we can move quicker to next type of annotations which require expert knowledge (transcriptions, named entities). If we stumble upon some issues in this process, we can improve and iterate.

Also, if the trained model works well enough, then maybe the text line polygons on the rest of the pages could be annotated semi-automatically.

Later images seem to give errors from time to time, e.g.

If you get an error like this you can click on the button below the image View source image to see if there's a more specific error. Sometimes just clicking on it fixes the problem.

We have some technicians who can work on this, but they would prefer to work mainly on Meise (BR) specimens.

Yes of course the pages selected for annotations could be different. We just chose the first 200, so it would be easy to navigate. You can link the pages (by groups of 20) you prefer to annotate like this https://arkindex.teklia.com/element/503d7e35-a2cd-4d98-a272-930519fc29b5?page=12 and then I'll update the progress table in the guide.

martinteklia commented 2 years ago

Now the herbarium sheets are grouped by institutions.

There are first 3 groups of 20 pages from each institution in the progress table. It's because maybe we won't get any annotators from some institutions and some institution has more time/annotators. If each institution annotates one group we'll have the 200 pages.

Cubey0 commented 2 years ago

@emhaston @Cubey0 From RBGE please.

(I guess others from other institutions ?)

@martinteklia

Could we please 1) check that the areas created by @Cubey0 & @emhaston in the RBGE dataset are OK

2) add @redrinkwater (r.drinkwater@rbge.org.uk) to the users of the software, please.

martinteklia commented 2 years ago

Could we please

1. check that the areas created by @Cubey0 & @emhaston  in the RBGE dataset are OK

@emhaston:

@Cubey0:

2. add @redrinkwater ([r.drinkwater@rbge.org.uk](mailto:r.drinkwater@rbge.org.uk)) to the users of the software, please.

The user has been created. You should receive an email to change the password. Be sure to check the spam or quarantine folder of your email. If you haven't received it, let me know.

Cubey0 commented 2 years ago

OK I'll go back a revisit my contribution and ask for a re-exam.

RBGE-Herbarium commented 2 years ago

And in that case, I'll keep going ...

matdillen commented 2 years ago

@martinteklia Could you add an.decoster@plantentuinmeise.be and elke.scheers@plantentuinmeise.be?

Cubey0 commented 2 years ago

.@redrinkwater has been working on some of the herbarium sheet annotations and they wanted to check how they should handle lines with mixed type written and handwritten text. Should they be segmented separately or together? i.e. example

martinteklia commented 2 years ago

@martinteklia Could you add an.decoster@plantentuinmeise.be and elke.scheers@plantentuinmeise.be?

The accounts have been created. They should receive an email to change the password.

.@redrinkwater has been working on some of the herbarium sheet annotations and they wanted to check how they should handle lines with mixed type written and handwritten text. Should they be segmented separately or together? i.e. example

If they are close enough and are vertically aligned, meaning that the handwritten part is not above or below the typewritten part, they should be segmented together. Usually it's better to have longer lines.

For example,

matdillen commented 2 years ago

@martinteklia I've annotated the first page of MeiseBG specimens. Do you have any feedback? I noted some potential issues:

Do you prefer a bit bigger rectangles for slightly skewed text, or a polygon?

martinteklia commented 2 years ago

@martinteklia I've annotated the first page of MeiseBG specimens. Do you have any feedback? I noted some potential issues:

* Handwritten text that is very closely entangled or even overlapping. This can be very tricky to annotate without risking overlap.

If there's a risk of overlap, it's ok if a part of the top or bottom loops are not inside the polygon.

* Dotted lines on preprinted labels. I suppose we ignore these as much as possible?

Later when fixing/annotating the transcriptions the dots will be ignored, so the HTR model should learn to ignore them as well.

* Text obscured by part of the specimen or another label. Should this be split up, or the obscuring piece enclosed in the box?

If the obscured part is small, then it should be enclosed. If a big part is hidden, then it's easier to split them up.

* Stamps, also stamps on top of other text. Is there a recommended way to capture the (curved) text from stamps? And what to do when there's a massive overlap?

We can ignore the curved text. Just the regular (non-curved text) in the middle is good enough. If there's a massive overlap then the stamp can be considered as background and ignored.

* MeiseBG barcode labels often have split numbers, and the initial 0 is occasionally only printed in half (like a bracket). I've made separate boxes now, because text recognition tends to view pieces of the barcode itself as quote marks.

It would be better for them to be a single line, because automatically figuring out the reading order of tiny rectangles is not trivial. Also in the other pages they have been annotated as a single line. If there will be a problem with "fake" quotation marks, then there could be a post-processing step, that would filter out quotation marks on lines that are inside the barcode label.

Do you prefer a bit bigger rectangles for slightly skewed text, or a polygon?

I would prefer polygon, but if the skew is very slight it's ok to have a bit bigger rectangle.

Except for the barcode numbers, the annotations look good to me.

Some lines have been forgotten on this page: https://arkindex.teklia.com/element/320c50a1-2ac0-4dfc-8c17-46b981ad2afd

matdillen commented 2 years ago

@martinteklia OK thanks, I've fixed the barcode annotations.

llivermore commented 2 years ago

@martinteklia can you add Krisztina Lohonya k.lohonya@nhm.ac.uk and Larissa Welton l.welton@nhm.ac.uk? Thanks!

martinteklia commented 2 years ago

Done. They can use https://arkindex.teklia.com/user/password_reset to change their password.

llivermore commented 2 years ago

@martinteklia can you double check Krisztina's account? She has submitted a password reset a few times and received nothing (also checking junk/spam).

llivermore commented 2 years ago

@martinteklia please ignore - overly-aggressive institutional spam filtering...

llivermore commented 2 years ago

@martinteklia can you add Jere Kahanpää - jere.kahanpaa@helsinki.fi? Jere will work on the Luomus sheets. Thanks :)

martinteklia commented 2 years ago

It's done.

There should already be 223 pages with text_line annotations, but I don't know if they're all complete or not.

So I think people who have started annotating can finish their set of 20 pages or at least finish partially annotated pages (if there are any). But no need to have more annotators at this point, unless they are from an institution that hasn't annotated anything yet.

llivermore commented 2 years ago

@martinteklia I asked @kris-loh to do some text_line annotations on the institutions that were unable to contribute. Are you able to easily get counts of page annotations by institution? If there are a few missing ones then NHM will do some additional annotations.

martinteklia commented 2 years ago
institution annotated_pages_count
BGBM 1
Kew 1
Luomus 0
MNHN 0
MeiseBG 37
NHM London 20
Naturalis 1
RBGE 163
Tartu 0
llivermore commented 2 years ago

@martinteklia if you could still add Jere Kahanpää - jere.kahanpaa@helsinki.fi then we can get some of the Luomus sheets annotated. @kris-loh and I will do some of the other institutes that are lacking annotations.

llivermore commented 2 years ago

@matdillen there was a sheet with missing annotations for text. I added most but was unsure about Text line 14:https://arkindex.teklia.com/element/320c50a1-2ac0-4dfc-8c17-46b981ad2afd?highlight=1e393775-3ee5-45a3-beab-2708a41e319b

martinteklia commented 2 years ago

@martinteklia if you could still add Jere Kahanpää - jere.kahanpaa@helsinki.fi then we can get some of the Luomus sheets annotated. @kris-loh and I will do some of the other institutes that are lacking annotations.

I already created the account before. The password can be changed here: https://arkindex.teklia.com/user/password_reset

The emails seem to get stuck in spam filters for some reason though, so they should be checked as well.

martinteklia commented 2 years ago

@martinteklia please ignore - overly-aggressive institutional spam filtering...

@llivermore could you send me the email marked as spam with all the headers, so we could try to fix the problem with the emails ending up in spam? Also if the spam filter gives the reason why it was marked as spam, could you share that as well?

Thanks!

llivermore commented 2 years ago

@martinteklia I will try and get 10 examples for each institute by the end of today. Are you still able to train on Monday and share results on the Wednesday team meeting?

llivermore commented 2 years ago

@kris-loh can you add text lines to the following institutes (brackets indicate number of sheets with text lines) so we have a minimum of 10 for each, 20 if you have time?

I have annotated 10 sheets for Tartu and 20 sheets for Kew.

martinteklia commented 2 years ago

@martinteklia I will try and get 10 examples for each institute by the end of today. Are you still able to train on Monday and share results on the Wednesday team meeting?

We have already started training a model for detecting text lines from these annotations and we should be able to demo it. The model can be trained again with more data when the polygon annotation has finished.

To train a model for transcribing, the transcriptions from google vision output have to be aligned to the annotated text lines and validated that they are correct. I might be able to train a model with unvalidated data, but most likely for the demo an older model would be used. This way an improvement could be demoed in the following meeting :)

Hopefully the HTR worker will be integrated to galaxy in time.

martinteklia commented 2 years ago

@kris-loh can you add text lines to the following institutes (brackets indicate number of sheets with text lines) so we have a minimum of 10 for each, 20 if you have time?

* Luomus (2)

* MNHN (3)

* Naturalis (7)

I have annotated 10 sheets for Tartu and 20 sheets for Kew.

BGBM also has less than 10 sheets (9).

droepert commented 2 years ago

I will add textlines for BGBM sheets

Von: martinteklia @.> Gesendet: Freitag, 5. November 2021 14:42 An: DiSSCo/SDR @.> Cc: Röpert, Dominik @.>; Mention @.> Betreff: Re: [DiSSCo/SDR] Curate herbarium sheet dataset (#7)

@kris-lohhttps://github.com/kris-loh can you add text lines to the following institutes (brackets indicate number of sheets with text lines) so we have a minimum of 10 for each, 20 if you have time?

I have annotated 10 sheets for Tartu and 20 sheets for Kew.

BGBM also has less than 10 sheets (9).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/DiSSCo/SDR/issues/7#issuecomment-961904025, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AATWBDCPEZDDF643NJMIFPLUKPNJ7ANCNFSM43B2HSAA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

llivermore commented 2 years ago

@martinteklia can we discuss how best to align the google vision output to train a transcription model on the coming Wednesday meeting? I presume this will be an additional manual step where we check and paste the transcriptions into the text lines? It would be good to work on this between the next two meetings (10th - 24th November) but it may take longer.