Daniel-Mietchen / ideas

A dumping ground for halfbaked ideas, some of which will hopefully be worked on soon
Other
25 stars 6 forks source link

Build a tool to find closest match in DLMF for a given mathematical expression #1777

Open Daniel-Mietchen opened 12 months ago

Daniel-Mietchen commented 12 months ago

Not sure whether that already exists but if I have some expressions like the ones below (from here) Screenshot from 2023-07-21 00-57-47

in a machine friendly format, then it would be nice to see how they or their components could be mapped to the Digital Library of Mathematical Functions. Such a mapping could serve as a bridge to support finding other articles that contain similar mathematical constructs, as per

Daniel-Mietchen commented 12 months ago

Pinging @physikerwelt who I presume has thought about this before.

Daniel-Mietchen commented 12 months ago

Apart from finding articles, such a normalized representation of mathematical concepts could perhaps also be a useful component for a tool for finding software that does something with these concepts, or even dedicated hardware (should it exist) for computing such things.

physikerwelt commented 8 months ago

What is the context of this ticket? What is the definition of close https://www.nist.gov/publications/evaluation-similarity-measure-factors-formulae-based-ntcir-11-math-task? I think there is no general answer. It depends on the aspect (according to the definition of @malteos) that is important for the user looking for the similarity.

Daniel-Mietchen commented 8 months ago

@physikerwelt The background is that I am interested in browsing the literature by mathematical formulas, as per

When I came across that paper, I was wondering which mathematical systems similar to that described by their equations might have been explored in other papers before, perhaps even in a completely different context. Yet I would not know an efficient mechanism by which I could find such papers based purely on the formulas / expressions or some abstract representation thereof. DLMF at least assists with the abstract representation bit, yet I am not aware of it having been used for literature search, hence the ticket.

In terms of defining similarity, I agree that there are multiple ways to go about that, and your paper illustrates this nicely. For now, I would be happy to use tooling based on any facet of similarity or even a combined measure as per Zhang and Youssef.

In short, if we have SwMATH to indicate which software was used in a useful subset of papers, it is probably not a far-fetched idea to think about a system that indicates which formulas were used in such a set of papers, and while exact matches of formulas may not work well in cases like my example above, something that maps onto a taxonomy like DLMF would seem like a good starting point.

malteos commented 8 months ago

My recommendation would be to take a math-optimized LLM, like Llemma, feed the formulas through the model, take the embeddings and do a k-nearest neighbor search in the embedding space. Given that the recent LLMs got quite good in handling math I am confident that this would produce already somewhat good results.