clamsproject / app-role-filler-binder

Apache License 2.0
0 stars 0 forks source link

input MMIF spec for RFB app #2

Open keighrim opened 1 month ago

keighrim commented 1 month ago

Based off from the model input (https://github.com/clamsproject/app-role-filler-binder-old/issues/3) the RFB app would expect two information

  1. scene type
  2. text document

Current target pipeline that RFB will use as upstream is SWT-docTR. In a MMIF output from the pipeline will have two views (swt view, ocr view) and will have these annotation objects

  1. TimePoint (swt view): hold time point and "raw" scene classification label (C, S, ...)
  2. TimeFrame (swt view): hold time-wise start/end and representative timepoints, and "binned" label (slate, credits, ... ) These binned labels are the labels that RFB will use
  3. TextDocument (ocr view): hold text contents
  4. Alignment (ocr view): anchors the TextDocuments back to the TimePoints.

Given that, what RFB app should specify as input types are

  1. TimeFrame where the scene types are recorded
  2. TextDocument where the text contexts are recorded

Then internally, it look for TimeFrame annotations first and grab all the TextDocument "aligned" to the frame, and aggregate necessary information from two annotations to perform the inference.

Tricky part here is there is no explicit Alignement annotation between TimeFrame and TextDocument, and instead we have TimeFrame with representatives attribute pseudo align a TF and TP , then there are explicit Alignment annotations between TP and TD.

@kelleyl you also mentioned a very relevant problem with the llava captioner app. Do you have anything else to add to the problem description?

wricketts commented 1 month ago

@keighrim -- My approach for this was actually to work backwards from TextDocument -> TP -> TimeFrame, since (I believe) a TextDocument can only have one TP, but a TimeFrame might have multiple TP representatives.

For context, this is the function I used to setup the annotations (it pulls out some extra info that the app won't need like the confidence score or the actual TimePoint value). Basically, I rely on a dictionary to map each representative TP's id to its TimeFrame, and then use the Alignment for each TextDocument to find its TP and do lookup.

But I'm not sure how this would work inside the app. I hard-coded the view ids here but it should be dynamic.

def gather_ocr_data(data_dir: str) -> List[Tuple[str, int, str, float, str]]:
    """
    Takes a directory of mmif files with views from SWT and DocTR.
    Iterates over each TextDocument in the DocTR view, and obtains the corresponding SWT label via Alignments.
    Returns a list of tuples, where each tuple contains the guid, timepoint, scene, confidence, and ocr text.

    :param data_dir: directory containing mmif files
    :return: list of tuples in the form [(guid, timepoint, scene, confidence, ocr text), ...]
    """
    path = pathlib.Path(data_dir)
    outputs = []
    for filename in tqdm(list(path.glob('*.mmif'))):
        with open(filename, 'r') as f:
            curr_mmif = json.load(f)
            curr_mmif = Mmif(curr_mmif)
        guid = filename.stem.split('.')[0]

        # grab the necessary views
        # in this batch, swt is in 'v_0', doctr chyrons are in 'v_2', doctr credits are in 'v_3'
        # TODO: Figure out an app-agnostic way of doing this?

        swt_view = curr_mmif.get_view_by_id('v_0')
        doctr_view = curr_mmif.get_view_by_id('v_3')
        timeframes = swt_view.get_annotations(at_type=AnnotationTypes.TimeFrame)

        # map tp representative to timeFrame annotation
        timepoints2frames = {tp_rep: tf for tf in timeframes for tp_rep in tf.get('representatives')}

        # map tp id to timePoint value
        timepoints = list(swt_view.get_annotations(at_type=AnnotationTypes.TimePoint))
        timepoints = {tp.get('id'): tp.get('timePoint') for tp in timepoints}

        for textdoc in doctr_view.get_documents():
            ocr_text = rf'{textdoc.text_value}'
            td_id = textdoc.id

            # To get the swt label, we need the alignment between textdocument and timepoint id.
            # Then we use that timepoint id to get the timeframe it represents.
            # From the timeframe, get the label and confidence.

            td_alignment = list(doctr_view.get_annotations(AnnotationTypes.Alignment, target=td_id))
            timepoint = td_alignment[0].get('source')  # e.g. "v_0:tp_54"
            tp_id = timepoint.split(':')[1]
            timepoint = timepoints[tp_id]
            scene_label = timepoints2frames[tp_id].get('label')
            confidence = timepoints2frames[tp_id].get('classification')[scene_label]
            outputs.append((guid, timepoint, scene_label, confidence, ocr_text))
            # Ex: (cpb-aacip-526-dj58c9s78v, 1187187, chyron, 0.5742753098408380, Glen Miller)
    return outputs
wricketts commented 1 month ago

Documenting some updates on this.

As discussed in a previous meeting, @haydenmccormick and I decided to just take the Alignment annotations as input for RFB. Specifically, the app selects all views in the mmif containing alignments between TimePoint and TextDocument annotations. DocTR generates such views and presumably other Clams OCR tools will as well. For each alignment, the app retrieves the respective TimePoint's "raw" scene classification label, and handles the binning on its own. This bypasses the need to lookup the SWT TimeFrame represented by that TimePoint to retrieve the high-level label.

For the output, as we decided earlier, RFB will generate TextDocument annotations, each containing a raw CSV string. Additionally, we decided to have it generate secondary alignments, mapping the source alignment within the OCR view, to the target RFB TextDocument.

wricketts commented 1 month ago

@keighrim --

We briefly mentioned the eventual inclusion of a runtime parameter that allows the user to define their own labelmap for the TimePoints. One important thing to note is that the current RFB model/parser will be somewhat inflexible for this. We made the assumption that SWT scene label for a credits frame would always be labeled "credits" (not "credit"), and that the label for a chyron frame would be labeled "chyron". (These labels are appended to the beginning of each ocr string for contextual information, but always classified as "O" during NER).

If a user of the swt-ocr-rfb pipeline wants to use a different custom labelset, then there's more potential for the ner model to give unexpected results, and the parser would definitely fail. It's not an immediate concern to us, but in the future these components would need to be reworked if that extra flexibility is needed.

keighrim commented 1 month ago

ATM I can think of very simple trick to make the the app a little bit flexible. How about making the app runs only in one more at a time, where mode can be either chyron or credits. And add additional parameter to filter the labels from the under lying classification annotations?

For example, user can call the app with parameters like this

cat output-from-swt-22-raw-singleletter-labels.mmif | curl -d@- "rfb.server:5000?mode=chyron,label=C,label=Y" 

And in the app code,

    def _annotate(self, mmif, **parameters):
        # do stuff
        # then retrieve relevant annotations, with the new alignment caching
        for view in req_mmif.get_all_views_contain(AnnotationTypes.TimePoint):
            for tp_ann in view.get_annotations(AnnotationTypes.TimePoint):
                if tp_ann.get_property('label') in parameters['label']:
                    for aligned in tp_ann.get_all_aligned():
                        if aligned.at_type == DocumentTypes.TextDocument:
                            lm_input = prepare_bert_input(parameter['mode'], aligned.text_value)
                            # do more stuff with LM 
        ...

This is obviously just a stopgap, and you're right in that we need some rework in the future to achieve a real flexibility. But the basic direction here successfully decouples the label set used for finetuning BERT (from TF's SWT v4 era) and the labelset used in TP classification in SWT.

keighrim commented 1 month ago

One more possible problem with taking the Alignment as de facto only required input type; not all TP (and Alignment) end up in a TF after "stitching" process. Hence when RFB processes all the TP that's aligned to a TD, that might over-generate RFB outputs for un-interested images.

This is relatively minor issue, and we can deal with it by cross filter in a post process. But I think something like this can be a bit safer approach

    def _annotate(self, mmif, **parameters):
        # do stuff
        # then retrieve relevant annotations, with the new alignment caching
        for view in req_mmif.get_all_views_contain(AnnotationTypes.TimeFrame):
            for tf_ann in view.get_annotations(AnnotationTypes.TimeFrame):
                if tf_ann.get_property('label') in parameters['label']:   # note that here we use "postbinned" label names, not the single-letter raw labels 
                    for tp_ann in [mmif[rep_id] for rep_id in tf_ann.get_property('representatives')]:
                        for aligned in tp_ann.get_all_aligned():
                            if aligned.at_type == DocumentTypes.TextDocument:
                                lm_input = prepare_bert_input(parameter['mode'], aligned.text_value)
                                # do more stuff with LM 
        ...
wricketts commented 1 month ago

not all TP (and Alignment) end up in a TF after "stitching" process. Hence when RFB processes all the TP that's aligned to a TD, that might over-generate RFB outputs for un-interested images.

I might have misunderstood, but I thought docTR only selects the TP(s) that are representatives of a TF. Wouldn't all of the Alignment annotations be between a representative TP and a TextDocument? Or can OCR apps potentially operate over non-representative TP(s)?

keighrim commented 1 month ago

Yeah, you're right. When an OCR app creates its TD annotations, it will always use pick "relevant" TP annotations only.

wricketts commented 1 month ago

Okay so supposing I try changing the app logic following your suggestion, would the input to RFB just be AnnotationTypes.TimeFrame ?

Should we then change the Alignment produced by RFB to be source TimeFrame -> target TextDocument ? You didn't seem a big fan of the first idea (source Alignment -> target TextDocument)