clamsproject / aapb-annenv-role-filler-binder

Annotation Environment for Credits and Slate annotation
0 stars 0 forks source link

annotator sync-up practice data #11

Open keighrim opened 1 year ago

keighrim commented 1 year ago

A thread to keep a record on the set of data (videos or images) that we can use to train new annotators.

keighrim commented 1 year ago

So the batch is pushed via https://github.com/clamsproject/aapb-annotations/commit/1ad4d54e591c92ad6d33972e941f27bc78cbce38 .

How I selected: from the 27 randomly sampled videos, I eliminated any videos that were annotated during "development" rounds either by G or J. Then among the remaining videos, I picked 5 GUIDs that have all different middle number (259, 507, 512, 516, 525), hoping that the number means some sort of clustering.

And here's the # of images under each video.

$ for l in $(cat  batches/creditparsing-11.txt); do echo $l ; ls ~/creditparsing-data/images/$l.* | wc -l ; done
cpb-aacip-259-nv998n13
46
cpb-aacip-507-930ns0mg7g
44
cpb-aacip-512-cj87h1fh8t
22
cpb-aacip-516-3t9d50gq8v
60
cpb-aacip-525-w66930q54c
72
keighrim commented 1 year ago

Next step will be writing a piece of code to measure the inter annotator agreement. Few things to think about while implementing a IAA calculation.

  1. There are three sets of "values" in the annotations; so how should we calculate IAA? For sets of strings (roles and fillers), we can do IOU-type similarity measure. For links, we can use pairwise F score.
    1. text spans tagged as "role"s
    2. text spans tagged as "filler"s
    3. linkage between a role and a filler
  2. There could be typos and/or correction, over-correction in string values. Thus, we might want to take some kind of edit distance measure to "fuzzy" match names.
snewman-aa commented 1 year ago

What are we doing when some annotators skip a frame that other annotators annotated? Should we calculate the IAA for just the annotators that did annotate it? Should one/some annotators skipping affect the score?

keighrim commented 1 year ago

Let's have a two calculations; one with all the images, and the other with images that both (or all) annotators annotated. By the first criteria, if one skipped an image while the other annotator added some r-f pairs, all the pairs will be counted as "disagreement".

jarumihooi commented 1 year ago

As an incomplete response to the above questions: We should generate multiple metrics...

On frame differences ====

  1. One percentage for frames that are in any way different.

  2. Another percentage for frames where one+ annotator skips while one+ annotator did not.

For each frame, on character differences ====

  1. On frames where there are a difference, we can evaluate for that frame how many characters are different. E.g. Annot1 Role: "Producur-" vs Annot2 Role: "Producer" = 2/9 char difference. Annot1 Role: "Producur-" Filler: "John Smith" vs Annot2 Role: "Producur-" Filler: "-John Smith" (assuming other similarities within the categories = 2 /20 char diff

Conversation with Marc and Sam also included using modified BLEU scores and considering dropping "linkage" errors to priority 2 for now (more details to be provided by Sam if needed).