STORY: As a Monarch UX designer, I want to get a demonstration of the structured information that current ML can extract from at least one class of textbook diagrams, so that I can better plan how teachers will need to modify the experiences we can create automatically.

jeffbl commented 3 months ago

[x] Choose 3 textbook diagrams as examples, in the same category
[x] Choose 2 additional textbook examples, in different categories, to keep in mind but not pursue directly in detail during this sprint (these may become the basis of the following sprint)
[x] Investigate viable ML models for extracting information useful to students
[x] propose initial ML/preprocessor flow
[x] create mockup preprocessor JSON examples for all three diagrams Followup will be for Venissa to create mockup Monarch experience that will later be made into a handler.

jeffbl commented 2 months ago

Draft will be to @VenissaCarolQuadros from @AndyBaiMQC by EOD Friday, as discussed in sprint meeting yesterday.

AndyBaiMQC commented 2 months ago

Workflow

Image Segmentation We could start with legacy as baseline, i.e. YOLO. Alternatively, there is a new multi-modal approach: https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once SEEM is a SOTA model for multi-modal segmentation. The key here is to segment both/or the sub-image and the text. For example, in the evolution diagram, we should be able to segment as tuples:
Image Understanding using LLaVa From each grouped message, we then pass the segmented image along with the texts (or just the image alone), and ask more detailed questions. *Note that we could parse the fields in 1) such that we have a natural order of events after sorting, or having LLM sort this down stream
(optional) Named Entity Recognition (NER) as RAG This is the next level if we wish to extract more insights into the explanations provided by LLM. For this, we could use a legacy NER model to extract the key entities, and pass them as prompt for iterative question asking to the LLM (similar to using a model as a RAG). This would be helpful if we wish to explore the contextual connections between the different segments (having a flow)

Json Schema: For all 3 graphs, the LLaVa understanding component is the same: { "$schema": "", "$id": "", "type": "", "title": "Segmentation Understanding by LLaVa", "description": "", "properties": { "explanations": { "description": "Char string paragraph explaning the inputs from segmentations", "type": "string", "minLength": 1 } }, "required": ["explanations"] }

AndyBaiMQC commented 2 months ago

https://llava-vl.github.io/llava-interactive/

This is a very useful tool (our use case is a lot easier)

AndyBaiMQC commented 2 months ago

Evolution Heart Cell

AndyBaiMQC commented 2 months ago

The actual schema for SEEM segmentation tool is similar to what we have for image segmentation: { "$schema": "", "$id": "", "type": "object", "title": "Segmentation Data", "description": "JSON containing pixel coordinates of different segments in an photograph", "definitions": { "normCoord": { "description": "A pair of normalized coordinates.", "type": "array", "items": { "type": "number", "minimum": 0, "maximum": 1 }, "minItems": 2, "maxItems": 2 }, "contour": { "type": "object", "description": "A single contour of a segment as detected. Each contour is a separate region.", "properties": { "coordinates": { "description": "The normalized coordinates forming the contour as returned by OpenCV.", "type": "array", "items": { "$ref": "#/definitions/normCoord" } }, "centroid": { "allOf": [ { "$ref": "#/definitions/normCoord" }, { "description": "The center of the contour with normalized coordinates." } ] }, "area": { "description": "The area of the image taken by this contour", "type": "number", "minimum": 0, "maximum": 1 } } } }, "properties": { "segments": { "type": "array", "description": "Divides the entire photograph into different segments where each segment indicates a noteworthy region or structure. For example a park photograph would be divided into sections that contain grass, sections that contain trees etc.", "items": { "type": "object", "description": "Contains information about each individual segment in an photograph. Information such as area, centre of segment etc are included in this section", "properties": { "name": { "description": "The name of the segment.", "type": "string" }, "contours": { "description": "The separate contours making up the segment.", "type": "array", "items": { "$ref": "#/definitions/contour" } }, "centroid": { "allOf": [ { "$ref": "#/definitions/normCoord" }, { "description": "The overall centroid of the segment with normalized coordinates." } ] }, "area": { "description": "Total area occupied by the segment in a given photograph. the area is normalised between 0 and 1", "type": "number", "minimum": 0, "maximum": 1 } }, "required": [ "name", "contours", "centroid", "area" ] } } }, "required": [ "segments" ] }

However, there would be further post-processing, therefore the final useable formats are: Evolution [{“text”:[”abiogenesis”, “3.8 billion years”], “images”:[<image/>]},…,…] Cell/Heart [{“text”:[<name/>], “images”:[<arrow/>, <image/>]},…] -> [{“name”: “image”]

Shared-Reality-Lab / IMAGE-server

STORY: As a Monarch UX designer, I want to get a demonstration of the structured information that current ML can extract from at least one class of textbook diagrams, so that I can better plan how teachers will need to modify the experiences we can create automatically. #872