OneZoom / tree-build

Scripts for assembling the tree, metadata and downstream data products such as popularity and popular images
MIT License
1 stars 2 forks source link

Using AI to extract Newick from images #45

Open davidebbo opened 8 months ago

davidebbo commented 8 months ago

Scientific papers often share phylogenetic trees as images, which makes them hard to use without a conversion to a machine readable format like Newick.

I did a brief experiment to see if AI engines could help with this, and the initial results are not good! I pointed the engine to the following image (from https://github.com/OneZoom/OZtree/discussions/620), with the prompt "Turn this tree into Newick format".

image

Google Bard came up with the following, which is completely wrong:

(Bandicota indica:0.0,
(Lepus tolai:0.0,
(Cebus albifrons:0.0,
(Hylobates lar:0.1,
(Pygathrix nemaeus:0.1,(Galeopterus variegatus:0.1,Daubentonia madagascariensis:0.1):0.0):0.0):0.1):0.0,
(Eulemur mongoz:0.1,
(Nycticebus coucang:0.1,(Tarsius bancanus:0.1,(Tupaia belangeri:0.0,
(Tupaia minor:0.0,(Tupaia montana:0.0,(Tupaia splendidula:0.1,(Tupaia tana:0.0,Tupaia nicobarica:0.0):0.1):1.0):1.0):1.0):1.0):1.0):0.0):0.1):0.0;

Microsoft Copilot (based on GPT, not clear if 3.5 or 4.0) did even worse, getting confused by the prefix before the species name, and generally ignoring many leaves:

(((KT028097:0.000000,Bandicota_indica:0.000000):0.000000,(KT028098:0.000000,Bandicota_bengalensis:0.000000):0.000000):0.000000,((KT028099:0.000000,Bandicota_savilei:0.000000):0.000000,(KT028100:0.000000,Bandicota_maxima:0.000000):0.000000):0.000000):0.000000,((KT028101:0.000000,Tupaia_nicobarica:0.000000):0.000000,(KT028102:0.000000,Tupaia_belangeri:0.000000):0.000000):0.000000);

I was a bit surprised that they did so poorly, as it doesn't feel like that hard a problem. Maybe better results can be obtained with different prompting, or with different engines.

This is a good area to explore, as it could potentially be a game changer to extract Newicks directly form papers.

hyanwong commented 8 months ago

I suspect other people might be working on this too. I will ask the OpenTree folks.

davidebbo commented 8 months ago

@hyanwong yes, that would be interesting, thanks!

I went back and tried to iterate with Bard, pointing out an example of what is wrong in its tree. It always agrees, saying things like "You are absolutely correct. The corrected Newick format for the tree is ...". And it does fix one little thing, with the rest still very random.

Worth noting that I'm using free offerings. It could be that the premium offerings for $20 per month perform better, and maybe at some point I'll try that!

lentinj commented 8 months ago

Would it help to convert the diagram to SVG first? I'd assume that most of the time you're dealing with a PDF of the paper that inkscape could happily convert. Even if it's in a bitmap, you could use inkscape to trace/OCR it. At least then there's a series of tokens to reason about. You could even write an algorithm to parse the SVG path, but I suspect it wouldn't withstand reality.

A tool to help a human do it might be most useful though IMO, where you could click each sibling node to create the level up. That wouldn't take very long to use, and it'd be pretty obvious if you've made a mistake, and the overlaid tree doesn't match the tree underneath.

davidebbo commented 8 months ago

Many papers are behind paywalls that I can't access, but usually, those that I can access are web pages, not PDFs. e.g. let's look at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6731702/ as an example.

In this case, the relevant diagram is figure 5. Some interesting challenges:

All that to say that just getting an image to work with is not straightforward. But let's assume we have that solved somehow.

The idea of converting the image to an SVG is interesting. But the challenge here is that the cladogram can come in so many different forms (orientation, style, labels, edge lengths, etc) that writing an algorithm to parse it could be a losing battle.

And I think that's where AI potentially makes a lot of sense, as (given the right training) it can adapt to an unpredictable set of variations without us needing to write a complex algorithm. Us humans can certainly do this quite easily, and the pitch is that AI should be able to as well.

A tool to help a human do it might be most useful though IMO, where you could click each sibling node to create the level up. That wouldn't take very long to use, and it'd be pretty obvious if you've made a mistake, and the overlaid tree doesn't match the tree underneath.

I guess I'm looking for a higher level of automation here, e.g. input the URL to a paper and get back a set of Newick trees.

lentinj commented 8 months ago

To view it at full res, you need to go into the viewer, which downloads it in small chunks

Download the PDF (link top-right). Importing page 14 into inkscape reveals how they made the diagram, they typed out the text on it's side, then drew the tree in a paint program to match the text spacing.

You get to skip any OCR with a PDF, but there isn't really a logical connection between the text and the tree to work with, for you or the AI.

writing an algorithm to parse it could be a losing battle

Yes, definitely. Trying to use AI makes a lot more sense. My guess is the battle is to regard the tree as interesting, and not just a background image. Prompt engineering might be what to do here, but I'm sure you already know that :)

davidebbo commented 8 months ago

Thanks for the tips, I was able to import the page in Inkscape (new to me) and turn it into an svg.

My take is that the AI can do better if it looks at the 'cooked' image like a human would, rather than try to reason based on separate text and tree input.

Yes, it probably comes down to using the right AI, and good prompt engineering. I'll try to go deeper into that when I get some cycles.