Open keighrim opened 5 months ago
@keighrim -- My approach for this was actually to work backwards from TextDocument
-> TP
-> TimeFrame
, since (I believe) a TextDocument
can only have one TP
, but a TimeFrame
might have multiple TP
representatives.
For context, this is the function I used to setup the annotations (it pulls out some extra info that the app won't need like the confidence score or the actual TimePoint
value). Basically, I rely on a dictionary to map each representative TP
's id to its TimeFrame
, and then use the Alignment
for each TextDocument
to find its TP
and do lookup.
But I'm not sure how this would work inside the app. I hard-coded the view ids here but it should be dynamic.
def gather_ocr_data(data_dir: str) -> List[Tuple[str, int, str, float, str]]:
"""
Takes a directory of mmif files with views from SWT and DocTR.
Iterates over each TextDocument in the DocTR view, and obtains the corresponding SWT label via Alignments.
Returns a list of tuples, where each tuple contains the guid, timepoint, scene, confidence, and ocr text.
:param data_dir: directory containing mmif files
:return: list of tuples in the form [(guid, timepoint, scene, confidence, ocr text), ...]
"""
path = pathlib.Path(data_dir)
outputs = []
for filename in tqdm(list(path.glob('*.mmif'))):
with open(filename, 'r') as f:
curr_mmif = json.load(f)
curr_mmif = Mmif(curr_mmif)
guid = filename.stem.split('.')[0]
# grab the necessary views
# in this batch, swt is in 'v_0', doctr chyrons are in 'v_2', doctr credits are in 'v_3'
# TODO: Figure out an app-agnostic way of doing this?
swt_view = curr_mmif.get_view_by_id('v_0')
doctr_view = curr_mmif.get_view_by_id('v_3')
timeframes = swt_view.get_annotations(at_type=AnnotationTypes.TimeFrame)
# map tp representative to timeFrame annotation
timepoints2frames = {tp_rep: tf for tf in timeframes for tp_rep in tf.get('representatives')}
# map tp id to timePoint value
timepoints = list(swt_view.get_annotations(at_type=AnnotationTypes.TimePoint))
timepoints = {tp.get('id'): tp.get('timePoint') for tp in timepoints}
for textdoc in doctr_view.get_documents():
ocr_text = rf'{textdoc.text_value}'
td_id = textdoc.id
# To get the swt label, we need the alignment between textdocument and timepoint id.
# Then we use that timepoint id to get the timeframe it represents.
# From the timeframe, get the label and confidence.
td_alignment = list(doctr_view.get_annotations(AnnotationTypes.Alignment, target=td_id))
timepoint = td_alignment[0].get('source') # e.g. "v_0:tp_54"
tp_id = timepoint.split(':')[1]
timepoint = timepoints[tp_id]
scene_label = timepoints2frames[tp_id].get('label')
confidence = timepoints2frames[tp_id].get('classification')[scene_label]
outputs.append((guid, timepoint, scene_label, confidence, ocr_text))
# Ex: (cpb-aacip-526-dj58c9s78v, 1187187, chyron, 0.5742753098408380, Glen Miller)
return outputs
Documenting some updates on this.
As discussed in a previous meeting, @haydenmccormick and I decided to just take the Alignment
annotations as input for RFB. Specifically, the app selects all views in the mmif containing alignments between TimePoint
and TextDocument
annotations. DocTR generates such views and presumably other Clams OCR tools will as well. For each alignment, the app retrieves the respective TimePoint
's "raw" scene classification label, and handles the binning on its own. This bypasses the need to lookup the SWT TimeFrame
represented by that TimePoint
to retrieve the high-level label.
For the output, as we decided earlier, RFB will generate TextDocument annotations, each containing a raw CSV string. Additionally, we decided to have it generate secondary alignments, mapping the source alignment within the OCR view, to the target RFB TextDocument.
@keighrim --
We briefly mentioned the eventual inclusion of a runtime parameter that allows the user to define their own labelmap for the TimePoints. One important thing to note is that the current RFB model/parser will be somewhat inflexible for this. We made the assumption that SWT scene label for a credits frame would always be labeled "credits"
(not "credit"), and that the label for a chyron frame would be labeled "chyron"
. (These labels are appended to the beginning of each ocr string for contextual information, but always classified as "O" during NER).
If a user of the swt-ocr-rfb pipeline wants to use a different custom labelset, then there's more potential for the ner model to give unexpected results, and the parser would definitely fail. It's not an immediate concern to us, but in the future these components would need to be reworked if that extra flexibility is needed.
ATM I can think of very simple trick to make the the app a little bit flexible.
How about making the app runs only in one more at a time, where mode can be either chyron
or credits
. And add additional parameter to filter the labels from the under lying classification annotations?
For example, user can call the app with parameters like this
cat output-from-swt-22-raw-singleletter-labels.mmif | curl -d@- "rfb.server:5000?mode=chyron,label=C,label=Y"
And in the app code,
def _annotate(self, mmif, **parameters):
# do stuff
# then retrieve relevant annotations, with the new alignment caching
for view in req_mmif.get_all_views_contain(AnnotationTypes.TimePoint):
for tp_ann in view.get_annotations(AnnotationTypes.TimePoint):
if tp_ann.get_property('label') in parameters['label']:
for aligned in tp_ann.get_all_aligned():
if aligned.at_type == DocumentTypes.TextDocument:
lm_input = prepare_bert_input(parameter['mode'], aligned.text_value)
# do more stuff with LM
...
This is obviously just a stopgap, and you're right in that we need some rework in the future to achieve a real flexibility. But the basic direction here successfully decouples the label set used for finetuning BERT (from TF's SWT v4 era) and the labelset used in TP classification in SWT.
One more possible problem with taking the Alignment
as de facto only required input type; not all TP
(and Alignment
) end up in a TF
after "stitching" process. Hence when RFB processes all the TP
that's aligned to a TD
, that might over-generate RFB outputs for un-interested images.
This is relatively minor issue, and we can deal with it by cross filter in a post process. But I think something like this can be a bit safer approach
def _annotate(self, mmif, **parameters):
# do stuff
# then retrieve relevant annotations, with the new alignment caching
for view in req_mmif.get_all_views_contain(AnnotationTypes.TimeFrame):
for tf_ann in view.get_annotations(AnnotationTypes.TimeFrame):
if tf_ann.get_property('label') in parameters['label']: # note that here we use "postbinned" label names, not the single-letter raw labels
for tp_ann in [mmif[rep_id] for rep_id in tf_ann.get_property('representatives')]:
for aligned in tp_ann.get_all_aligned():
if aligned.at_type == DocumentTypes.TextDocument:
lm_input = prepare_bert_input(parameter['mode'], aligned.text_value)
# do more stuff with LM
...
not all
TP
(andAlignment
) end up in aTF
after "stitching" process. Hence when RFB processes all theTP
that's aligned to aTD
, that might over-generate RFB outputs for un-interested images.
I might have misunderstood, but I thought docTR only selects the TP
(s) that are representatives of a TF
. Wouldn't all of the Alignment
annotations be between a representative TP
and a TextDocument
? Or can OCR apps potentially operate over non-representative TP
(s)?
Yeah, you're right. When an OCR app creates its TD
annotations, it will always use pick "relevant" TP
annotations only.
Okay so supposing I try changing the app logic following your suggestion, would the input to RFB just be AnnotationTypes.TimeFrame
?
Should we then change the Alignment
produced by RFB to be source TimeFrame
-> target TextDocument
? You didn't seem a big fan of the first idea (source Alignment
-> target TextDocument
)
Based off from the model input (https://github.com/clamsproject/app-role-filler-binder-old/issues/3) the RFB app would expect two information
Current target pipeline that RFB will use as upstream is SWT-docTR. In a MMIF output from the pipeline will have two views (swt view, ocr view) and will have these annotation objects
TimePoint
(swt view): hold time point and "raw" scene classification label (C
,S
, ...)TimeFrame
(swt view): hold time-wise start/end and representative timepoints, and "binned" label (slate
,credits
, ... ) These binned labels are the labels that RFB will useTextDocument
(ocr view): hold text contentsAlignment
(ocr view): anchors theTextDocument
s back to theTimePoint
s.Given that, what RFB app should specify as input types are
TimeFrame
where the scene types are recordedTextDocument
where the text contexts are recordedThen internally, it look for
TimeFrame
annotations first and grab all theTextDocument
"aligned" to the frame, and aggregate necessary information from two annotations to perform the inference.Tricky part here is there is no explicit
Alignement
annotation betweenTimeFrame
andTextDocument
, and instead we haveTimeFrame
withrepresentatives
attribute pseudo align aTF
andTP
, then there are explicitAlignment
annotations betweenTP
andTD
.@kelleyl you also mentioned a very relevant problem with the llava captioner app. Do you have anything else to add to the problem description?