Open benscott opened 3 years ago
@mlbonhomme and @martinteklia
I have started annotating the Pinned Insects in ArkIndex (https://arkindex.teklia.com/element/f53b200c-8f15-4e5d-b4f5-6db047b95d71) and had a couple of questions:
Can you add some additional types?
How many examples should I annotate?
For our DataMatrix barcodes do you want the exact barcode region labelled as in this image?
I added the additional types you asked.
How many examples should I annotate?
I think 50 pages should be enough at first.
For our DataMatrix barcodes do you want the exact barcode region labelled as in this image?
Yes, that's good.
To add transcriptions to text lines you should click on the button that looks like A+, instead of changing the name of the text line element.
@mlbonhomme wrote a doc at some point about annotations and she wanted to have a discussion with you about it.
Created repository for training datasets: https://github.com/DiSSCo/sdr-datasets
There is a now out-dated annotation guide here https://gitlab.com/mlbonhomme/arkindex_annotation/-/tree/master/SYNTHESYS which I'll be updating today, I'll share the link to it here when it's done !
Here are the updated annotation guidelines: https://notes.teklia.com/s/zO3O4i_-p. I am of course available to discuss them if there is anything strange or wrong about them, and if you need different element types they can always be added.
I took a look at the existing annotations and some of them will need to be deleted or corrected I think, as I saw that the element tyoe "decoration" was used, and some transcriptions were put as the name of the text line elements — you should all have the necessary user rights to do this.
@mlbonhomme is Specimen 010266087 correctly annotated?
@llivermore it looks fine to me! maybe label 12 should be a barcode/qrcode if we want to try to identify those? or maybe only the qr code itself should be annotated as such, I don't really know what would be best or if it would make much of a difference when training later on
@mlbonhomme Apologies - Label 12 was a duplicate of Barcode 13. I had accidentally renamed the names of the types and duplicated it when attempting to tidy >_< It should be fixed now. I'll go through and fix the rest and will ask one of my team to do some more next Monday (2021-05-10).
@mlbonhomme I have annotated the first 20 specimen images. Could you check them and give feedback/suggestions for improvement?
I had a few questions/notes:
label
and text line
when there is only one text line in the label, but I think it will be simpler later on than having to deal with particularities (especially since the annotations are not hierarchical, so to see if a text line is "inside" a label or not we would have to look at the polygons)to-review
or something, that you would put on the relevant text lines so later whoever does this verification can filter the text lines and only see the ones marked as to-review and review them ? And that reviewer would remove the class once the text is clarified/corrected.I had a look at the annotated slides, and it isn't really necessary to following ascending/descending letters like in https://arkindex.teklia.com/element/4f28a97e-850d-4339-bf60-cdeb7d1b7d20?highlight=ee1d141c-3e54-4dbd-9484-6b284b17ed90 — for HTR what matters most is the "middle" part, and whatever model we use later for text line detection will likely not create lines with shapes like this. Otherwise it all looks fine!
@martinteklia and @mlbonhomme between myself and Niki (one of my team) we have annotated the first 78 specimens in the Pinned Insects (NHMUK) dataset. We can annotate more when required.
Is it useful to you to indicate the language of label text? In the previously curated herbarium sheet dataset, language was one of the selection requirements. I noticed we had a couple of French labels (e.g., Specimen 010622079)
Do you need anything more from us on the pinned insects? Do you need more for initial testing on the other datasets (e.g. #3 and #7 )?
Is it useful to you to indicate the language of label text? In the previously curated herbarium sheet dataset, language was one of the selection requirements. I noticed we had a couple of French labels (e.g., Specimen 010622079)
If it is possible it would be much better to include additional language labels
@martinteklia and @mlbonhomme between myself and Niki (one of my team) we have annotated the first 78 specimens in the Pinned Insects (NHMUK) dataset. We can annotate more when required.
Thanks! We'll try to train an initial model from the 78 examples, but most likely we'll need more annotated data to improve the model.
Is it useful to you to indicate the language of label text? In the previously curated herbarium sheet dataset, language was one of the selection requirements. I noticed we had a couple of French labels (e.g., Specimen 010622079)
Yes, indicating the language is useful, because it allows us to better analyze the errors. Maybe the model won't work well on French, because there are only a few examples of annotated data.
Do you need anything more from us on the pinned insects? Do you need more for initial testing on the other datasets (e.g. #3 and #7 )?
For the pinned insects it should be ok for the initial training. On Herbarium and Microscope slides there are no specimen nor label annotations - only text lines.
We're working on setting up LabelStudio for annotating the named entities of pinned insects. There's a possibility to split the data into tabs, so each annotator could choose its own tab and there would be no conflicts (two people annotating the same paragraph).
How many annotators will there be? (How many tabs should we create?) @llivermore
@martinteklia for the first 100 or so it will be three of us but two of us are likely to do less. Are you proposing that each annotator gets assigned a fixed number of specimens or lines? For example:
We are likely to have more annotators in the larger ~1,000 specimen/page dataset.
@llivermore Sorry for the delay.
In the end, instead of tabs there are 3 projects: one of which has 60% of the examples and two others have 20%. It's up to you to decide who will annotate which one.
Could you give me the emails of the annotators, so I could send the sign-up link?
@llivermore the annotation presentation is in the attachment
@martinteklia we have finished the the named entity tagging. I will have a few questions and one of the digitisers has noticed some errors in the transcribed text.
From Pete Wing on Annotator 2 (I will need to make some decisions on some of these):
@martinteklia A question both Pete and I had is, "can/should a single string/set of words have more than one category applied to it? E.g. a determination may encompass the taxanomic name, authority, determiner name and date of determination, so the main category for this data is determination but the constituent parts have their own, separate categories."
See example below where we have a collection/donation, indicated by the "from" prefix before the person's name "K. C. Liew".
@llivermore The transcription errors will have to be fixed on arkindex - it's not possible to do it in labelstudio.
The errors must be fixed at text_line
level. From the lines we will generate new paragraph level transcriptions.
Those will be imported again to labelstudio for NER annotation. Previous task with incorrect transcription must be deleted first.
A link to the page on arkindex will be displayed in the labelstudio task to make it easier to find it on arkindex and fix the transcription. (Unfortunately we're unable to make the link clickable)
@martinteklia A question both Pete and I had is, "can/should a single string/set of words have more than one category applied to it? E.g. a determination may encompass the taxanomic name, authority, determiner name and date of determination, so the main category for this data is determination but the constituent parts have their own, separate categories."
See example below where we have a collection/donation, indicated by the "from" prefix before the person's name "K. C. Liew".
The NER tool we use doesn't support nested entities (multiple categories). So we said earlier, that for the MVP we won't support nested entities either. It can be a future development.
Task #3224 - '*' denotes the specimen was bred, would that be an appropriate category?
If that's always the case it might be easier to have a process with custom logic like - if no taxon
entity found and text contains *
then it is bred
.
All nested entities have been removed.
@martinteklia The following need reloading from Arkindex: https://arkindex.teklia.com/element/6ac2db87-6569-4290-8fd6-b2ee2fc6be7a https://arkindex.teklia.com/element/03fe9029-3c3d-43bd-9ce8-b36e2cd94975 https://arkindex.teklia.com/element/00964e92-1f98-4ea7-91fd-40cdeec91b44 https://arkindex.teklia.com/element/162b73b0-ae36-45cf-94b9-1893f8768e4d https://arkindex.teklia.com/element/8dfcfc38-33bd-41e6-89f7-92e0e5aa9926
The following need checking by Pete (and probably reloading): https://arkindex.teklia.com/element/ec353ce8-a1d7-400a-817e-f79d54f17016 https://arkindex.teklia.com/element/162b73b0-ae36-45cf-94b9-1893f8768e4d https://labelstudio.arkindex.org/projects/4/data?tab=94&task=3182
@martinteklia The following need reloading from Arkindex: https://arkindex.teklia.com/element/6ac2db87-6569-4290-8fd6-b2ee2fc6be7a https://arkindex.teklia.com/element/03fe9029-3c3d-43bd-9ce8-b36e2cd94975 https://arkindex.teklia.com/element/00964e92-1f98-4ea7-91fd-40cdeec91b44 https://arkindex.teklia.com/element/162b73b0-ae36-45cf-94b9-1893f8768e4d https://arkindex.teklia.com/element/8dfcfc38-33bd-41e6-89f7-92e0e5aa9926
The 5 pages have been reloaded from arkindex to label studio @llivermore
@martinteklia I have finished labelling all known entities - I think we are ready to train and evaluate! :)
Note for myself: the trained model doesn't seem to work very well with iCollections images. They have noisy backgrounds from pin holes but otherwise the labels are similar.
From Galaxy test:
Source specimen: BMNH(E)1851836
Training images of pinned insects, with: