clamsproject / app-role-filler-binder

Apache License 2.0
0 stars 0 forks source link

RFB as a TR (or OCR) consumer #4

Open keighrim opened 2 months ago

keighrim commented 2 months ago

This is more like a status tracking thread, rather than an issue.


RFB as it stands now is the primary consumer of text recognition (or optical character recognition) components, and since there are more than one TR apps in the CLAMS appdir, I felt it'd be nice to have a place to put relevant information together in one place, especially regarding I/O relation between TR apps and RFB (and future TR consumers) .

CLAMS TR Apps

And then we also have similarly working - but not conventional TR - apps

Input specs of the TR apps

Possible upstream scenarios

  1. start from blank state (no upstream): TR app should go through the entire video at a certain sample rate and perform transcriptions on all the extracted frame images.
  2. start from point-wise scene type recognition (e.g., SWT): TR app should pick all the relevant TimePoint annotations (relevant TP labels should be passed as a runtime parameter) to transcribe.
  3. start from interval-wise scene type recognition (e.g., SWT + stitcher): TR app should pick a "representative" frame (or a set of frames) for each TimeFrame annotation (again, relevant TF labels should be passed as a runtime parameter) and transcribe all representatives.

Then optionally, it can start from any of above pipeline plus a text localization (TL) app (e.g., EAST)

  1. blank + TL
  2. TP + TL
  3. TF + TL

TODO: assess the current situation with TL, and decide whether using TL actually makes sense, considering the cost and gains.

Output specs of the TR apps

In general, all TR/OCR should return, at minimum,

And the "td" annotation should cover the entire text content in a single image, and "bb-top" annotation should draw a axis-aligned rectangle that covers entire the text region.

And then if the text localization (TL) feature of the underlying TR engine is capable of returning element-wise bounding boxes (words, lines, blocks, ...)

Note that even without the lower level bounding boxes, the top-level TextDocument should "render" the line and block information using newline characters. So the lower-level bounding box annotations are only adding coordinate information for those secondary "text" annotations, hoping that coordinates are useful for future processing.

How RFB handles (or should handle) various TR outputs

Assuming that all TR apps are outputing at least the "minimal" output types, RFB (or other TR consumer, including issues like https://github.com/clamsproject/aapb-evaluations/issues/52), should be able to "grab" the correct view by searching for one that contains TextDocument, BoundingBox, Alignment types, since RFB model (as it's implemented now) doesn't care "visual" features like coordinates of lines or words.

However, since the current model relies on the scene type as a part of input string, we still need a way to grab the scene type label by going further down to alignment chains and pulling out the SWT (or alike) TimeFrame annotations. And this is where the "input" spec of the TR apps becomes relevant to this problem.

Future plan

As raised in https://github.com/clamsproject/mmif-visualizer/issues/41 and https://github.com/clamsproject/aapb-evaluations/issues/35 , we'd like to have a concept of app groups (or app patterns) that define a similar (if not identical) I/O patterns for apps that does the same kind of information extraction/transformation.

And this issue to start the attempt of the pattern definition with existing apps.