DiSSCo / SDR

Specimen Data Refinery
Apache License 2.0
6 stars 0 forks source link

Curate pinned insects dataset #2

Open benscott opened 3 years ago

benscott commented 3 years ago

Training images of pinned insects, with:

llivermore commented 3 years ago

@mlbonhomme and @martinteklia

I have started annotating the Pinned Insects in ArkIndex (https://arkindex.teklia.com/element/f53b200c-8f15-4e5d-b4f5-6db047b95d71) and had a couple of questions:

Can you add some additional types?

How many examples should I annotate?

For our DataMatrix barcodes do you want the exact barcode region labelled as in this image? Pinned Insect ArkIndex Example

martinteklia commented 3 years ago

I added the additional types you asked.

How many examples should I annotate?

I think 50 pages should be enough at first.

For our DataMatrix barcodes do you want the exact barcode region labelled as in this image?

Yes, that's good.

To add transcriptions to text lines you should click on the button that looks like A+, instead of changing the name of the text line element.

@mlbonhomme wrote a doc at some point about annotations and she wanted to have a discussion with you about it.

benscott commented 3 years ago

Created repository for training datasets: https://github.com/DiSSCo/sdr-datasets

mlbonhomme commented 3 years ago

There is a now out-dated annotation guide here https://gitlab.com/mlbonhomme/arkindex_annotation/-/tree/master/SYNTHESYS which I'll be updating today, I'll share the link to it here when it's done !

mlbonhomme commented 3 years ago

Here are the updated annotation guidelines: https://notes.teklia.com/s/zO3O4i_-p. I am of course available to discuss them if there is anything strange or wrong about them, and if you need different element types they can always be added.

I took a look at the existing annotations and some of them will need to be deleted or corrected I think, as I saw that the element tyoe "decoration" was used, and some transcriptions were put as the name of the text line elements — you should all have the necessary user rights to do this.

llivermore commented 3 years ago

@mlbonhomme is Specimen 010266087 correctly annotated?

mlbonhomme commented 3 years ago

@llivermore it looks fine to me! maybe label 12 should be a barcode/qrcode if we want to try to identify those? or maybe only the qr code itself should be annotated as such, I don't really know what would be best or if it would make much of a difference when training later on

llivermore commented 3 years ago

@mlbonhomme Apologies - Label 12 was a duplicate of Barcode 13. I had accidentally renamed the names of the types and duplicated it when attempting to tidy >_< It should be fixed now. I'll go through and fix the rest and will ask one of my team to do some more next Monday (2021-05-10).

llivermore commented 3 years ago

@mlbonhomme I have annotated the first 20 specimen images. Could you check them and give feedback/suggestions for improvement?

I had a few questions/notes:

mlbonhomme commented 3 years ago

I had a look at the annotated slides, and it isn't really necessary to following ascending/descending letters like in https://arkindex.teklia.com/element/4f28a97e-850d-4339-bf60-cdeb7d1b7d20?highlight=ee1d141c-3e54-4dbd-9484-6b284b17ed90 — for HTR what matters most is the "middle" part, and whatever model we use later for text line detection will likely not create lines with shapes like this. Otherwise it all looks fine!

llivermore commented 3 years ago

@martinteklia and @mlbonhomme between myself and Niki (one of my team) we have annotated the first 78 specimens in the Pinned Insects (NHMUK) dataset. We can annotate more when required.

Is it useful to you to indicate the language of label text? In the previously curated herbarium sheet dataset, language was one of the selection requirements. I noticed we had a couple of French labels (e.g., Specimen 010622079)

Do you need anything more from us on the pinned insects? Do you need more for initial testing on the other datasets (e.g. #3 and #7 )?

qgroom commented 3 years ago

Is it useful to you to indicate the language of label text? In the previously curated herbarium sheet dataset, language was one of the selection requirements. I noticed we had a couple of French labels (e.g., Specimen 010622079)

If it is possible it would be much better to include additional language labels

martinteklia commented 3 years ago

@martinteklia and @mlbonhomme between myself and Niki (one of my team) we have annotated the first 78 specimens in the Pinned Insects (NHMUK) dataset. We can annotate more when required.

Thanks! We'll try to train an initial model from the 78 examples, but most likely we'll need more annotated data to improve the model.

Is it useful to you to indicate the language of label text? In the previously curated herbarium sheet dataset, language was one of the selection requirements. I noticed we had a couple of French labels (e.g., Specimen 010622079)

Yes, indicating the language is useful, because it allows us to better analyze the errors. Maybe the model won't work well on French, because there are only a few examples of annotated data.

Do you need anything more from us on the pinned insects? Do you need more for initial testing on the other datasets (e.g. #3 and #7 )?

For the pinned insects it should be ok for the initial training. On Herbarium and Microscope slides there are no specimen nor label annotations - only text lines.

martinteklia commented 3 years ago

We're working on setting up LabelStudio for annotating the named entities of pinned insects. There's a possibility to split the data into tabs, so each annotator could choose its own tab and there would be no conflicts (two people annotating the same paragraph).

How many annotators will there be? (How many tabs should we create?) @llivermore

llivermore commented 3 years ago

@martinteklia for the first 100 or so it will be three of us but two of us are likely to do less. Are you proposing that each annotator gets assigned a fixed number of specimens or lines? For example:

We are likely to have more annotators in the larger ~1,000 specimen/page dataset.

martinteklia commented 3 years ago

@llivermore Sorry for the delay.

In the end, instead of tabs there are 3 projects: one of which has 60% of the examples and two others have 20%. It's up to you to decide who will annotate which one.

Could you give me the emails of the annotators, so I could send the sign-up link?

martinteklia commented 3 years ago

@llivermore the annotation presentation is in the attachment

synthesys_annotation_guidelines.odp

llivermore commented 2 years ago

@martinteklia we have finished the the named entity tagging. I will have a few questions and one of the digitisers has noticed some errors in the transcribed text.

llivermore commented 2 years ago

From Pete Wing on Annotator 2 (I will need to make some decisions on some of these):

llivermore commented 2 years ago

@martinteklia A question both Pete and I had is, "can/should a single string/set of words have more than one category applied to it? E.g. a determination may encompass the taxanomic name, authority, determiner name and date of determination, so the main category for this data is determination but the constituent parts have their own, separate categories."

See example below where we have a collection/donation, indicated by the "from" prefix before the person's name "K. C. Liew".

Composite example

martinteklia commented 2 years ago

@llivermore The transcription errors will have to be fixed on arkindex - it's not possible to do it in labelstudio. The errors must be fixed at text_line level. From the lines we will generate new paragraph level transcriptions. Those will be imported again to labelstudio for NER annotation. Previous task with incorrect transcription must be deleted first. A link to the page on arkindex will be displayed in the labelstudio task to make it easier to find it on arkindex and fix the transcription. (Unfortunately we're unable to make the link clickable)

@martinteklia A question both Pete and I had is, "can/should a single string/set of words have more than one category applied to it? E.g. a determination may encompass the taxanomic name, authority, determiner name and date of determination, so the main category for this data is determination but the constituent parts have their own, separate categories."

See example below where we have a collection/donation, indicated by the "from" prefix before the person's name "K. C. Liew".

The NER tool we use doesn't support nested entities (multiple categories). So we said earlier, that for the MVP we won't support nested entities either. It can be a future development.

Task #3224 - '*' denotes the specimen was bred, would that be an appropriate category?

If that's always the case it might be easier to have a process with custom logic like - if no taxon entity found and text contains * then it is bred.

llivermore commented 2 years ago

All nested entities have been removed.

@martinteklia The following need reloading from Arkindex: https://arkindex.teklia.com/element/6ac2db87-6569-4290-8fd6-b2ee2fc6be7a https://arkindex.teklia.com/element/03fe9029-3c3d-43bd-9ce8-b36e2cd94975 https://arkindex.teklia.com/element/00964e92-1f98-4ea7-91fd-40cdeec91b44 https://arkindex.teklia.com/element/162b73b0-ae36-45cf-94b9-1893f8768e4d https://arkindex.teklia.com/element/8dfcfc38-33bd-41e6-89f7-92e0e5aa9926

The following need checking by Pete (and probably reloading): https://arkindex.teklia.com/element/ec353ce8-a1d7-400a-817e-f79d54f17016 https://arkindex.teklia.com/element/162b73b0-ae36-45cf-94b9-1893f8768e4d https://labelstudio.arkindex.org/projects/4/data?tab=94&task=3182

martinteklia commented 2 years ago

@martinteklia The following need reloading from Arkindex: https://arkindex.teklia.com/element/6ac2db87-6569-4290-8fd6-b2ee2fc6be7a https://arkindex.teklia.com/element/03fe9029-3c3d-43bd-9ce8-b36e2cd94975 https://arkindex.teklia.com/element/00964e92-1f98-4ea7-91fd-40cdeec91b44 https://arkindex.teklia.com/element/162b73b0-ae36-45cf-94b9-1893f8768e4d https://arkindex.teklia.com/element/8dfcfc38-33bd-41e6-89f7-92e0e5aa9926

The 5 pages have been reloaded from arkindex to label studio @llivermore

llivermore commented 2 years ago

@martinteklia I have finished labelling all known entities - I think we are ready to train and evaluate! :)

llivermore commented 2 years ago

Note for myself: the trained model doesn't seem to work very well with iCollections images. They have noisy backgrounds from pin holes but otherwise the labels are similar.

From Galaxy test: image

Source specimen: BMNH(E)1851836