Imageomics / Image-Datapalooza-2023

Repository for the Image Datapalooza 2023 event held at OSU in August 2023.
Creative Commons Zero v1.0 Universal
3 stars 2 forks source link

Anatomic images with associated detailed descriptions from taxonomic treatments dataset #4

Open hlapp opened 11 months ago

hlapp commented 11 months ago

I'd like to work with the Plazi taxonomic treatments dataset, which includes many images with associated anatomical descriptions. However, the images typically contain several subpanels within each, and likewise the text combines the descriptions for all the sub panels. I'm hoping to separate these into correctly grouped images and descriptions, and further to link the text to taxonomic names and anatomy ontology concepts.

Originally posted by @balhoff in https://github.com/Imageomics/Image-Datapalooza-2023/issues/3#issuecomment-1670562197

hlapp commented 11 months ago

@balhoff I hope you don't mind me moving this into an issue of its own for further discussion. I know we've looked into this dataset a bit. Can you post links to what's been done so far? I'm also wondering whether we shouldn't compile a full dataset on HF. I forget what we estimated in terms of full dataset size and sub-sampled (for experimentation) dataset size.

nickynicolson commented 11 months ago

Sounds a bit similar to a project I have planned using illustrations and their reference specimens / descriptions: https://github.com/orgs/KewBridge/discussions/1 and https://github.com/KewBridge/specimens2illustrations/

balhoff commented 11 months ago

I'm starting with this RDF dataset: https://github.com/plazi/treatments-rdf

There are about 300,000 figures in the dataset. From some preliminary queries I think there are about 6200 different families with figure data. Here's one representative example: https://dx.doi.org/10.5281/zenodo.5846642

I need to run a SPARQL query to extract a smaller dataset, probably a max of 3 figures from each family. Next I plan to try converting the figure legends into a structured format using OntoGPT SPIRES: https://github.com/monarch-initiative/ontogpt Then get some advice on now to segment the actual images into subpanels. The goal would be to have the specific text for each subpanel associated with it, as well as extract mentioned taxonomic names and anatomical structures.

hlapp commented 11 months ago

@nickynicolson if you copy&pasted the specimens2illustrations link (precluding a typo), then the repo is presumably set to be private. (It's inaccessible.) Is it possible to make it public?

hlapp commented 11 months ago

I don't want to argue against the SPIRES / OntoGPT route, but for the purpose of making the dataset most suitable for CLIP training, I actually don't think structured subcaptions are necessary.

Rather, for CLIP training the main milestone outcome would, I think, be a dataset where each data item is a pair of one subfigure image (rather than the full figure image consisting of 5-10 subfigure images), and the corresponding sub caption text (rather than the entire caption, which describes not one but many (sub)figure images).

@work4cs @samuelstevens any thoughts or comments from a CLIP model training perspective?

nickynicolson commented 11 months ago

@nickynicolson if you copy&pasted the specimens2illustrations link (precluding a typo), then the repo is presumably set to be private. (It's inaccessible.) Is it possible to make it public?

Sorry about that - done now

balhoff commented 11 months ago

I don't want to argue against the SPIRES / OntoGPT route, but for the purpose of making the dataset most suitable for CLIP training, I actually don't think structured subcaptions are necessary.

Rather, for CLIP training the main milestone outcome would, I think, be a dataset where each data item is a pair of one subfigure image (rather than the full figure image consisting of 5-10 subfigure images), and the corresponding sub caption text (rather than the entire caption, which describes not one but many (sub)figure images).

My main goal with SPIRES was just to use its facilities for extracting a structure from text. The core of this would just be finding the subfigure text parts within the full caption, along with the subfigure ID. The API makes it pretty simple to get back a data structure, given a chunk of text and a small data model.

hlapp commented 11 months ago

Here's one representative example: https://dx.doi.org/10.5281/zenodo.5846642

Here's the result of trying to "parse" the figure caption into subcaptions by simply prompting ChatGPT: https://chat.openai.com/share/f5eb204b-31bd-4022-8adc-a20e6d3fe83a

balhoff commented 11 months ago

Another attempt with a different prompt: https://chat.openai.com/share/3473e6cd-9853-4ec7-b932-f6545e34ab0c

That one does a good job of including the relevant information for each sub-panel into its description.

samuelstevens commented 11 months ago

Rather, for CLIP training the main milestone outcome would, I think, be a dataset where each data item is a pair of one subfigure image (rather than the full figure image consisting of 5-10 subfigure images), and the corresponding sub caption text (rather than the entire caption, which describes not one but many (sub)figure images).

I agree with this completely. In general, we don't want super structured text captions with CLIP because the web-scale data (upon which CLIP models were pretrained) don't have structured text captions. So captions with sentences, with overlapping subcaptions, etc is fine for CLIP.