dginev / nnexus

Auto-linking for Mathematical Concepts for PlanetMath.org, Wikipedia, and beyond.
MIT License
18 stars 3 forks source link

Improve Longest Token Matching algorithm #10

Open dginev opened 11 years ago

dginev commented 11 years ago

I would want to revisit and understand better the current NNexus algorithm for longest token matching (LTM) and try to contribute further enhancements from Corpus linguistics, e.g. Term-likelihood analysis[1] and Suffix arrays[2]

[1] http://personalpages.manchester.ac.uk/staff/sophia.ananiadou/ijodl2000.pdf [2] http://www.cs.jhu.edu/~kchurch/wwwfiles/CL_suffix_array.pdf

dginev commented 4 years ago

General note 6 years later, improving the recognition algorithm (and ideally getting an evaluation test harness) would be the best way to bring nnexus into the world of mainstream tooling.

It's now tempting to speak of neural models for named entity recognition that can be transferred over, but math concepts continue to not have an adequate large scale dataset for supervised learning. So leveraging existing state of the art results is not as immediate as I would like.

Also, we've discussed in the past that it is a matter of simple engineering to make the precise matching algorithm a customization option in nnexus, so that we always preserve the current strategy in the code base, and let the end user decide which approach suits their needs best. (also allowing backwards compatibility). Fully on board with that for the practical direction.