clamsproject / aapb-annotations

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration
3 stars 0 forks source link

batch issue with chyron annotation (orig: add chyron readme/guideline) #24

Closed wricketts closed 10 months ago

wricketts commented 1 year ago

Because

The newshour-chyron project contains an empty guidelines.md and an empty README.md. As stated in the repository README file, we want each subdirectory to contain its own README.md detailing annotation project-specific information (e.g. project name, annotator demographics, annotation environment information, gold generation code dependencies, etc.). Each subdirectory should also contain a guidelines.md with relevant annotation guidelines used in the annotation project.

Done when

Additional context

No response

keighrim commented 1 year ago

For the record, currently the chyron annotation is done on these ~11~ 10 items;

$ cat newshour-chyron/220701-batch2/3da973a1_13Jul2022_13h30m59s.json | jq '. | .file[].fname' -r | grep -o -E "^[^.]+" | sort -u
cpb-aacip-507-154dn40c26
cpb-aacip-507-6w96689725
cpb-aacip-507-9882j68s35
cpb-aacip-507-bz6154fc44
cpb-aacip-507-bz6154fc44 (1)
cpb-aacip-507-cf9j38m509
cpb-aacip-507-nk3610wp6s
cpb-aacip-507-pr7mp4wf25
cpb-aacip-507-vd6nz81n6r
cpb-aacip-525-028pc2v94s
cpb-aacip-525-bg2h70914g
$ cat newshour-chyron/220701-batch2/3da973a1_13Jul2022_13h30m59s.json | jq '. | .file[].fname' | wc -l
11

but the directory says it was done on batch2, which has 24 items in it.

And here the diff

$ diff <(sort -u batches/batch2.txt ) <(cat newshour-chyron/220701-batch2/3da973a1_13Jul2022_13h30m59s.json | jq '. | .file[].fname' -r | grep -o -E "^[^.]+" | sort -u)
2,3d1
< cpb-aacip-507-1v5bc3tf81
< cpb-aacip-507-4t6f18t178
5d2
< cpb-aacip-507-7659c6sk7z
7a5
> cpb-aacip-507-bz6154fc44 (1)
9,10d6
< cpb-aacip-507-m61bk17f5g
< cpb-aacip-507-n29p26qt59
12d7
< cpb-aacip-507-pc2t43js98
14,16d8
< cpb-aacip-507-r785h7cp0z
< cpb-aacip-507-v11vd6pz5w
< cpb-aacip-507-v40js9j432
18,20d9
< cpb-aacip-507-vm42r3pt6h
< cpb-aacip-507-zk55d8pd1h
< cpb-aacip-507-zw18k75z4h
22,23d10
< cpb-aacip-525-3b5w66b279
< cpb-aacip-525-9g5gb1zh9b
keighrim commented 1 year ago

And here another diff between those 10 items and 20 items in aapb-collaboration-21 batch.

$ diff <(cat newshour-chyron/220701-batch2/3da973a1_13Jul2022_13h30m59s.json | jq '. | .file[].fname' -r | grep -o -E "^[^.]+" | sort -u)  <(ls newshour-namedentity/220601-aapb-collaboration-21/ | cut -c 1-24 | sort -u)
1a2,5
> cpb-aacip-507-1v5bc3tf81
> cpb-aacip-507-4746q1t25k
> cpb-aacip-507-4t6f18t178
> cpb-aacip-507-6h4cn6zk04
2a7
> cpb-aacip-507-7659c6sk7z
4,5d8
< cpb-aacip-507-bz6154fc44
< cpb-aacip-507-bz6154fc44 (1)
6a10
> cpb-aacip-507-n29p26qt59
7a12
> cpb-aacip-507-pc2t43js98
8a14,16
> cpb-aacip-507-r785h7cp0z
> cpb-aacip-507-v11vd6pz5w
> cpb-aacip-507-v40js9j432
10,11c18,20
< cpb-aacip-525-028pc2v94s
< cpb-aacip-525-bg2h70914g
---
> cpb-aacip-507-vm42r3pt6h
> cpb-aacip-507-zk55d8pd1h
> cpb-aacip-507-zw18k75z4h
keighrim commented 1 year ago

I couldn't find any documentation on how those 24 items in batch2 were picked and how different they are from the aapb-collaboration-21 batch, and how only 10 of them ended in the annotation files.

Some additional context;


@caseyedavis12, @kelleyl do you have any idea, recollection on how they were compiled and annotated?

keighrim commented 1 year ago

and in the Brandeis server, we don't have cpb-aacip-507-vd6nz81n6r video file.

jarumihooi commented 10 months ago

Original Title: add guidelines.md and README.md for chyron annotation These tasks are completed by #40 However, an issue has been brought up about the batches.

At the present moment the raw has 2 files: one json has 14 vids in its text. The other has 10. The golds have 24 items.

It seems like at some point, @keighrim may have collapsed the unneccessary division between the subdirectories in gold and combined the data. Is it possible that the 10 of 24 is just a subgroup during the annotation effort? From this commit: https://github.com/clamsproject/aapb-annotations/commit/d1fb17fbf1722826fb951565a15aa24a916224ab Would you mind seeing if that's the case?

keighrim commented 10 months ago

Can confirm that the 10/14 division in two json files was actually adds up to the entire batch2 GUIDs, except for the (1)-suffixed file, which seems to be a duplicate anyway.

$ diff <(cat batches/batch2.txt | sort -u) <(for j in newshour-chyron/220701-batch2/*.json ; do cat $j | jq '. | .file[].fname' -r | grep -o -E "^[^.]+" ; done |sort -u )
1d0
< # it is not clear how this batch was chosen in the past: https://github.com/clamsproject/aapb-annotations/issues/24#issuecomment-1638870043
8a8
> cpb-aacip-507-bz6154fc44 (1)

For the future reference, here's another diff between aapb-collaboration-21 (20 GUIDs) and batch2 (24 GUIDs);

$ diff batches/batch2.txt batches/aapb-collaboration-21.txt
1c1
< # it is not clear how this batch was chosen in the past: https://github.com/clamsproject/aapb-annotations/issues/24#issuecomment-1638870043
---
> # see https://github.com/clamsproject/aapb-collaboration/issues/21 for more info.
3a4
> cpb-aacip-507-4746q1t25k
4a6
> cpb-aacip-507-6h4cn6zk04
8d9
< cpb-aacip-507-bz6154fc44
10d10
< cpb-aacip-507-m61bk17f5g
22,25d21
< cpb-aacip-525-028pc2v94s
< cpb-aacip-525-3b5w66b279
< cpb-aacip-525-9g5gb1zh9b
< cpb-aacip-525-bg2h70914g

Summary: 2 out, 6 in.

keighrim commented 10 months ago

Looks like all the issues and questions are resolved. Closing the issue.