Open keighrim opened 1 year ago
So the batch is pushed via https://github.com/clamsproject/aapb-annotations/commit/1ad4d54e591c92ad6d33972e941f27bc78cbce38 .
How I selected: from the 27 randomly sampled videos, I eliminated any videos that were annotated during "development" rounds either by G or J. Then among the remaining videos, I picked 5 GUIDs that have all different middle number (259, 507, 512, 516, 525), hoping that the number means some sort of clustering.
And here's the # of images under each video.
$ for l in $(cat batches/creditparsing-11.txt); do echo $l ; ls ~/creditparsing-data/images/$l.* | wc -l ; done
cpb-aacip-259-nv998n13
46
cpb-aacip-507-930ns0mg7g
44
cpb-aacip-512-cj87h1fh8t
22
cpb-aacip-516-3t9d50gq8v
60
cpb-aacip-525-w66930q54c
72
Next step will be writing a piece of code to measure the inter annotator agreement. Few things to think about while implementing a IAA calculation.
What are we doing when some annotators skip a frame that other annotators annotated? Should we calculate the IAA for just the annotators that did annotate it? Should one/some annotators skipping affect the score?
Let's have a two calculations; one with all the images, and the other with images that both (or all) annotators annotated. By the first criteria, if one skipped an image while the other annotator added some r-f pairs, all the pairs will be counted as "disagreement".
As an incomplete response to the above questions: We should generate multiple metrics...
On frame differences ====
One percentage for frames that are in any way different.
Another percentage for frames where one+ annotator skips while one+ annotator did not.
For each frame, on character differences ====
Conversation with Marc and Sam also included using modified BLEU scores and considering dropping "linkage" errors to priority 2 for now (more details to be provided by Sam if needed).
A thread to keep a record on the set of data (videos or images) that we can use to train new annotators.