dginev / nnexus

Auto-linking for Mathematical Concepts for PlanetMath.org, Wikipedia, and beyond.
MIT License
18 stars 3 forks source link

"instance" linked in "For instance" #43

Open holtzermann17 opened 11 years ago

holtzermann17 commented 11 years ago

Target of the link is: http://planetmath.org/substitutionsinpropositionallogic

Source article (place where the link lives) is: http://planetmath.org/topicentryoncomplexanalysis

For instance, putting imaginary numbers into the power series for the exponential function, we find...

dginev commented 11 years ago

Moving to 3.0 milestone.

dginev commented 10 years ago

I am revisiting the accuracy issues at the moment. This report reduces to a linguistic deficiency - NNexus does not currently recognize prepositional phrases. It is easy to image a document where "instance" is used as a term, and additionalyl "for instance" is used separately to provide an example. So this is a legitimate bug that requires enhancing NNexus with more linguistic capabilities.

With the exception of phrases containing pronouns, most propositional phrases form a closed set in English and are relatively well capture by Wiktionary (they have 701 of them here).

I was reading recently that the mantra which works for a lot of startups is "do the simplest approach first", so introducing a hardcoded list of phrases to avoid (ignoring pronoun variation for the moment) could be the easiest solution here. The "correct" solution of course is to have part of speech information and only treat regular Noun Phrases (NPs) as concept candidates. But we don't have a reliably part-of-speech tagger for mathematical texts yet.

dginev commented 10 years ago

So, on the POS tagger front, the conventionally accepted "best" free tool is the Stanford tagger. An important discovery I just made is that someone has gone through the effort of creating a self-contained Perl wrapper around the Stanford Core NLP tools (133 MB in size!) and published it on CPAN. So that makes it easy to acquire a tagger as a dependency. Currently trying that out.

dginev commented 10 years ago

But it also requires a Java SDK, so NNexus gets a total of ~200 MB heavier in size. Interesting to see if we gain anything in result.

holtzermann17 commented 10 years ago

@dginev - if we ever get around to integrating the "recommender system" that I worked on in my Day Job (2013 edition), https://github.com/kmi/decipher we would also have a Java dependency there. I can imagine having a dedicated (virtual) server for running web services.

dginev commented 10 years ago

I have found a possibly perfect match for augmenting NNexus with POS tags, namely the SENNA toolkit. It is both efficient and has state-of-art precision and recall, which makes it a perfect fit. Using native C I could process a large arXiv document (6500 words) in 3 seconds, including the parsing overhead.

So I have the feeling for regular NNexus jobs the POS parsing might be only an insignificant hit to the overall runtime. I am currently writing a Perl wrapper for the library, in order to easily leverage SENNA in NNexus. My other experiments were performed in the context of LLaMaPUn and my general PhD work.