This is more like a status tracking thread, rather than an issue.
RFB as it stands now is the primary consumer of text recognition (or optical character recognition) components, and since there are more than one TR apps in the CLAMS appdir, I felt it'd be nice to have a place to put relevant information together in one place, especially regarding I/O relation between TR apps and RFB (and future TR consumers) .
start from blank state (no upstream): TR app should go through the entire video at a certain sample rate and perform transcriptions on all the extracted frame images.
start from point-wise scene type recognition (e.g., SWT): TR app should pick all the relevant TimePoint annotations (relevant TP labels should be passed as a runtime parameter) to transcribe.
start from interval-wise scene type recognition (e.g., SWT + stitcher): TR app should pick a "representative" frame (or a set of frames) for each TimeFrame annotation (again, relevant TF labels should be passed as a runtime parameter) and transcribe all representatives.
Then optionally, it can start from any of above pipeline plus a text localization (TL) app (e.g., EAST)
blank + TL
TP + TL
TF + TL
TODO: assess the current situation with TL, and decide whether using TL actually makes sense, considering the cost and gains.
Output specs of the TR apps
In general, all TR/OCR should return, at minimum,
TextDocument (td)
BoundingBox (bb-top)
Alignment (a-top, between td and bb-top)
And the "td" annotation should cover the entire text content in a single image, and "bb-top" annotation should draw a axis-aligned rectangle that covers entire the text region.
And then if the text localization (TL) feature of the underlying TR engine is capable of returning element-wise bounding boxes (words, lines, blocks, ...)
Alignment (a-lower, between bb-lower and ling-lower annotations)
Note that even without the lower level bounding boxes, the top-level TextDocument should "render" the line and block information using newline characters. So the lower-level bounding box annotations are only adding coordinate information for those secondary "text" annotations, hoping that coordinates are useful for future processing.
How RFB handles (or should handle) various TR outputs
Assuming that all TR apps are outputing at least the "minimal" output types, RFB (or other TR consumer, including issues like https://github.com/clamsproject/aapb-evaluations/issues/52), should be able to "grab" the correct view by searching for one that contains TextDocument, BoundingBox, Alignment types, since RFB model (as it's implemented now) doesn't care "visual" features like coordinates of lines or words.
However, since the current model relies on the scene type as a part of input string, we still need a way to grab the scene type label by going further down to alignment chains and pulling out the SWT (or alike) TimeFrame annotations. And this is where the "input" spec of the TR apps becomes relevant to this problem.
This is more like a status tracking thread, rather than an issue.
RFB as it stands now is the primary consumer of text recognition (or optical character recognition) components, and since there are more than one TR apps in the CLAMS appdir, I felt it'd be nice to have a place to put relevant information together in one place, especially regarding I/O relation between TR apps and RFB (and future TR consumers) .
CLAMS TR Apps
And then we also have similarly working - but not conventional TR - apps
Input specs of the TR apps
Possible upstream scenarios
TimePoint
annotations (relevant TP labels should be passed as a runtime parameter) to transcribe.TimeFrame
annotation (again, relevant TF labels should be passed as a runtime parameter) and transcribe all representatives.Then optionally, it can start from any of above pipeline plus a text localization (TL) app (e.g., EAST)
TODO: assess the current situation with TL, and decide whether using TL actually makes sense, considering the cost and gains.
Output specs of the TR apps
In general, all TR/OCR should return, at minimum,
TextDocument
(td)BoundingBox
(bb-top)Alignment
(a-top, between td and bb-top)And the "td" annotation should cover the entire text content in a single image, and "bb-top" annotation should draw a axis-aligned rectangle that covers entire the text region.
And then if the text localization (TL) feature of the underlying TR engine is capable of returning element-wise bounding boxes (words, lines, blocks, ...)
Paragraph
(for TL blocks),Sentence
(for TL lines),Token
(for TL words) (as ling-lower annotation)BoundingBox
(bb-lower)Alignment
(a-lower, between bb-lower and ling-lower annotations)Note that even without the lower level bounding boxes, the top-level
TextDocument
should "render" the line and block information using newline characters. So the lower-level bounding box annotations are only adding coordinate information for those secondary "text" annotations, hoping that coordinates are useful for future processing.How RFB handles (or should handle) various TR outputs
Assuming that all TR apps are outputing at least the "minimal" output types, RFB (or other TR consumer, including issues like https://github.com/clamsproject/aapb-evaluations/issues/52), should be able to "grab" the correct view by searching for one that contains
TextDocument
,BoundingBox
,Alignment
types, since RFB model (as it's implemented now) doesn't care "visual" features like coordinates of lines or words.However, since the current model relies on the scene type as a part of input string, we still need a way to grab the scene type label by going further down to alignment chains and pulling out the SWT (or alike)
TimeFrame
annotations. And this is where the "input" spec of the TR apps becomes relevant to this problem.Future plan
As raised in https://github.com/clamsproject/mmif-visualizer/issues/41 and https://github.com/clamsproject/aapb-evaluations/issues/35 , we'd like to have a concept of app groups (or app patterns) that define a similar (if not identical) I/O patterns for apps that does the same kind of information extraction/transformation.
And this issue to start the attempt of the pattern definition with existing apps.