Open Daniel-Mietchen opened 1 year ago
Pinging @physikerwelt who I presume has thought about this before.
Apart from finding articles, such a normalized representation of mathematical concepts could perhaps also be a useful component for a tool for finding software that does something with these concepts, or even dedicated hardware (should it exist) for computing such things.
What is the context of this ticket? What is the definition of close https://www.nist.gov/publications/evaluation-similarity-measure-factors-formulae-based-ntcir-11-math-task? I think there is no general answer. It depends on the aspect (according to the definition of @malteos) that is important for the user looking for the similarity.
@physikerwelt The background is that I am interested in browsing the literature by mathematical formulas, as per
When I came across that paper, I was wondering which mathematical systems similar to that described by their equations might have been explored in other papers before, perhaps even in a completely different context. Yet I would not know an efficient mechanism by which I could find such papers based purely on the formulas / expressions or some abstract representation thereof. DLMF at least assists with the abstract representation bit, yet I am not aware of it having been used for literature search, hence the ticket.
In terms of defining similarity, I agree that there are multiple ways to go about that, and your paper illustrates this nicely. For now, I would be happy to use tooling based on any facet of similarity or even a combined measure as per Zhang and Youssef.
In short, if we have SwMATH to indicate which software was used in a useful subset of papers, it is probably not a far-fetched idea to think about a system that indicates which formulas were used in such a set of papers, and while exact matches of formulas may not work well in cases like my example above, something that maps onto a taxonomy like DLMF would seem like a good starting point.
My recommendation would be to take a math-optimized LLM, like Llemma, feed the formulas through the model, take the embeddings and do a k-nearest neighbor search in the embedding space. Given that the recent LLMs got quite good in handling math I am confident that this would produce already somewhat good results.
Not sure whether that already exists but if I have some expressions like the ones below (from here)
in a machine friendly format, then it would be nice to see how they or their components could be mapped to the Digital Library of Mathematical Functions. Such a mapping could serve as a bridge to support finding other articles that contain similar mathematical constructs, as per
1 .