clamsproject / aapb-annotations

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration
3 stars 0 forks source link

compile a batch for RFB annotation project #44

Closed keighrim closed 9 months ago

keighrim commented 9 months ago

A set of videos that (supposedly) have various instances of "scenes with text" (https://github.com/clamsproject/app-swt-detection/issues/1) that are ideal for creating labeled data for roles and fillers (key-value pairs) extraction and linker app.

A few things to consider

  1. consider going beyond newshour videos.
  2. consider mixing different shows and collections, while considering balance between sizes
keighrim commented 9 months ago

And here are numbers of mp4 files we currently have;

$ for d in /llc_data/clams/wgbh/* ; do if [ -d $d ]; then echo $d ; find $d -name "*.mp4" | wc -l ; fi ;done
/llc_data/clams/wgbh/credits
97
/llc_data/clams/wgbh/great_depression
9
/llc_data/clams/wgbh/NewsHour
2896
/llc_data/clams/wgbh/NJN_Network
46
/llc_data/clams/wgbh/Peabody
492
/llc_data/clams/wgbh/sonyids_connecticut_pb_20180423101623
1132
/llc_data/clams/wgbh/sonyids_newjerseynetwork_20180423192048
3152
/llc_data/clams/wgbh/thumbdrive
5
/llc_data/clams/wgbh/transcripts
0
/llc_data/clams/wgbh/usbkey
0
/llc_data/clams/wgbh/wrvr
847
keighrim commented 9 months ago

One additional (completely optional) characteristic of the video that we might want to consider for sampling process is the production/air period. The pbcore information is accessible by going to https://americanarchive.org/catalog/cpb-aacip_xxx-yyyyyyy.pbcore address. (note where the _ is located instead of -, and .pbcore extension in the address).

keighrim commented 9 months ago

fixed via #45.