Closed wricketts closed 10 months ago
For the record, currently the chyron annotation is done on these ~11~ 10 items;
$ cat newshour-chyron/220701-batch2/3da973a1_13Jul2022_13h30m59s.json | jq '. | .file[].fname' -r | grep -o -E "^[^.]+" | sort -u
cpb-aacip-507-154dn40c26
cpb-aacip-507-6w96689725
cpb-aacip-507-9882j68s35
cpb-aacip-507-bz6154fc44
cpb-aacip-507-bz6154fc44 (1)
cpb-aacip-507-cf9j38m509
cpb-aacip-507-nk3610wp6s
cpb-aacip-507-pr7mp4wf25
cpb-aacip-507-vd6nz81n6r
cpb-aacip-525-028pc2v94s
cpb-aacip-525-bg2h70914g
$ cat newshour-chyron/220701-batch2/3da973a1_13Jul2022_13h30m59s.json | jq '. | .file[].fname' | wc -l
11
but the directory says it was done on batch2
, which has 24 items in it.
And here the diff
$ diff <(sort -u batches/batch2.txt ) <(cat newshour-chyron/220701-batch2/3da973a1_13Jul2022_13h30m59s.json | jq '. | .file[].fname' -r | grep -o -E "^[^.]+" | sort -u)
2,3d1
< cpb-aacip-507-1v5bc3tf81
< cpb-aacip-507-4t6f18t178
5d2
< cpb-aacip-507-7659c6sk7z
7a5
> cpb-aacip-507-bz6154fc44 (1)
9,10d6
< cpb-aacip-507-m61bk17f5g
< cpb-aacip-507-n29p26qt59
12d7
< cpb-aacip-507-pc2t43js98
14,16d8
< cpb-aacip-507-r785h7cp0z
< cpb-aacip-507-v11vd6pz5w
< cpb-aacip-507-v40js9j432
18,20d9
< cpb-aacip-507-vm42r3pt6h
< cpb-aacip-507-zk55d8pd1h
< cpb-aacip-507-zw18k75z4h
22,23d10
< cpb-aacip-525-3b5w66b279
< cpb-aacip-525-9g5gb1zh9b
And here another diff between those 10 items and 20 items in aapb-collaboration-21
batch.
$ diff <(cat newshour-chyron/220701-batch2/3da973a1_13Jul2022_13h30m59s.json | jq '. | .file[].fname' -r | grep -o -E "^[^.]+" | sort -u) <(ls newshour-namedentity/220601-aapb-collaboration-21/ | cut -c 1-24 | sort -u)
1a2,5
> cpb-aacip-507-1v5bc3tf81
> cpb-aacip-507-4746q1t25k
> cpb-aacip-507-4t6f18t178
> cpb-aacip-507-6h4cn6zk04
2a7
> cpb-aacip-507-7659c6sk7z
4,5d8
< cpb-aacip-507-bz6154fc44
< cpb-aacip-507-bz6154fc44 (1)
6a10
> cpb-aacip-507-n29p26qt59
7a12
> cpb-aacip-507-pc2t43js98
8a14,16
> cpb-aacip-507-r785h7cp0z
> cpb-aacip-507-v11vd6pz5w
> cpb-aacip-507-v40js9j432
10,11c18,20
< cpb-aacip-525-028pc2v94s
< cpb-aacip-525-bg2h70914g
---
> cpb-aacip-507-vm42r3pt6h
> cpb-aacip-507-zk55d8pd1h
> cpb-aacip-507-zw18k75z4h
I couldn't find any documentation on how those 24 items in batch2
were picked and how different they are from the aapb-collaboration-21
batch, and how only 10 of them ended in the annotation files.
Some additional context;
@caseyedavis12, @kelleyl do you have any idea, recollection on how they were compiled and annotated?
and in the Brandeis server, we don't have cpb-aacip-507-vd6nz81n6r video file.
Original Title: add guidelines.md and README.md for chyron annotation These tasks are completed by #40 However, an issue has been brought up about the batches.
At the present moment the raw has 2 files: one json has 14 vids in its text. The other has 10. The golds have 24 items.
It seems like at some point, @keighrim may have collapsed the unneccessary division between the subdirectories in gold and combined the data. Is it possible that the 10 of 24 is just a subgroup during the annotation effort? From this commit: https://github.com/clamsproject/aapb-annotations/commit/d1fb17fbf1722826fb951565a15aa24a916224ab Would you mind seeing if that's the case?
Can confirm that the 10/14 division in two json files was actually adds up to the entire batch2
GUIDs, except for the (1)
-suffixed file, which seems to be a duplicate anyway.
$ diff <(cat batches/batch2.txt | sort -u) <(for j in newshour-chyron/220701-batch2/*.json ; do cat $j | jq '. | .file[].fname' -r | grep -o -E "^[^.]+" ; done |sort -u )
1d0
< # it is not clear how this batch was chosen in the past: https://github.com/clamsproject/aapb-annotations/issues/24#issuecomment-1638870043
8a8
> cpb-aacip-507-bz6154fc44 (1)
For the future reference, here's another diff between aapb-collaboration-21
(20 GUIDs) and batch2
(24 GUIDs);
$ diff batches/batch2.txt batches/aapb-collaboration-21.txt
1c1
< # it is not clear how this batch was chosen in the past: https://github.com/clamsproject/aapb-annotations/issues/24#issuecomment-1638870043
---
> # see https://github.com/clamsproject/aapb-collaboration/issues/21 for more info.
3a4
> cpb-aacip-507-4746q1t25k
4a6
> cpb-aacip-507-6h4cn6zk04
8d9
< cpb-aacip-507-bz6154fc44
10d10
< cpb-aacip-507-m61bk17f5g
22,25d21
< cpb-aacip-525-028pc2v94s
< cpb-aacip-525-3b5w66b279
< cpb-aacip-525-9g5gb1zh9b
< cpb-aacip-525-bg2h70914g
Summary: 2 out, 6 in.
Looks like all the issues and questions are resolved. Closing the issue.
Because
The newshour-chyron project contains an empty
guidelines.md
and an emptyREADME.md
. As stated in the repository README file, we want each subdirectory to contain its ownREADME.md
detailing annotation project-specific information (e.g. project name, annotator demographics, annotation environment information, gold generation code dependencies, etc.). Each subdirectory should also contain aguidelines.md
with relevant annotation guidelines used in the annotation project.Done when
Additional context
No response