dginev / nnexus

Auto-linking for Mathematical Concepts for PlanetMath.org, Wikipedia, and beyond.
MIT License
18 stars 3 forks source link

javascript:void(0) - docs, alt text, etc. #34

Closed holtzermann17 closed 11 years ago

holtzermann17 commented 11 years ago

First all - kudos on getting this javascript stuff working (I guess that isn't quite JOBAD integration but it's a great start).

Here's my example article: http://metameso.org/beta/bitesizedsandwich

Links are to:

... I initially find 2 links curious since I'm just linking against one domain (PlanetMath), but indeed, the article on ordinals defines these synonyms:

So, I think the 2 links is definitely a "feature", but again, some alt text (with the titles and ideally the MSC codes of the 2 articles) would be great. Also, I'd assume that we should eventually get rid of the link to the logic article based on an MSC rule, given that 54C99 and 03F15, 03E10 are quite far away.

dginev commented 11 years ago

There are no alt-tags for HTML anchors. In a way, you probably can't make it prettier than this and still have the same functionality, without using at least a single JavaScript function. Such links are quite common on sites that use JavaScript pop-ups, e.g. a lot of ticket-booking sites of airline companies use such links to pop-up various info menus (allowed baggage and whatnot).

My key idea was - make something workable that works on any website (e.g. when using the NNexus Glasses) and allow people to customize the behavior - it is really easy to override the way things are styled and presented specifically in Planetary from the PHP code, you just need the right CSS and JavaScript bound on a.nnexus_concepts and a.nnexus_concept.

dginev commented 11 years ago

As to the second link being irrelevant, thanks, this is useful information, I will add it to my disambiguation testbed.

holtzermann17 commented 11 years ago

you just need the right CSS and JavaScript bound on a.nnexus_concepts and a.nnexus_concept.

@tkw1536 is this something that JOBAD could help with?

dginev commented 11 years ago

@holtzermann17 JOBAD can certainly hook into the elements and override whatever you need, the real question is what do you want to see as functionality on these "multi-links".

dginev commented 11 years ago

Here are similarity metrics between the MSC classes of interest above.

The article is in 54C99, the ordinals definition of continuous function is in 03F/E, the continuous article is in all of 26A15, 54C05, 81-00, 82-00, 83-00, 46L05.

It bears mention that the current algorithm disregards any information given in the article (i.e. the 54C99) and is instead trying to provide a well-scoring cluster of concepts.

Here is a table of the similarities:

CC1   | CC2   | Score
03    | 26    | 0.01952391
03    | 54    | 0.04870472
03    | 81    | 0.01361661
03    | 82    | 0.00071465
03    | 83    | 0.00278636

The second row, gives a log2 penalty of -1.3, which is essentially tiny for my current algorithm. That raises a valid point whether my algorithm isn't being too generous with the logarithmic penalty - I can also use a log of base 10, which would turn the penalty into a slightly harsher -3...

I will think a bit more carefully on this example and adjust my formula to something more penalizing when the evidence is just a single phrase. Note that "continuous function" has length 19, so it contributes a weight of 12 (=19-7) which would overwhelm most log penalties.

What I fear is that if I make the penalties for different categories too harsh, we will have very partial recall. But the golden middle is somewhere out there...

dginev commented 11 years ago

Well, it's a simple switch that achieves not linking to the logic (03) definition - instead of using a linear weight for the length, I can use a geometric one:

linear:

w(continuous function) = 19 -7 = 12

geometric:

w(continuous function) = (19-7)/7 = 1.7

The second successfully avoids including the logic link, but it also misses a variety of potentially relevant links from different MSC classes. It's hard to have our cake and eat it to - we either get "boxed in" a very small group of MSC classes, or we end up overlinking with a variety of definitions from said classes...

My first weight metric was based on counting the concepts, rather than looking at their lengths, and the geometric metric is a bit of both - it ends up close to simply counting the concepts, but it does include a weight boost as the length increases.

One approach is to think what would users prefer - fewer but certainly relevant links, or more widespread but potentially over-informative links.

dginev commented 11 years ago

Then, we can look at "antipodal point", category 51:

CC1   | CC2   | Score
51    | 03    | 0.00869706
51    | 54    | 0.00795024

That concept ought to be linked together with "continuous function" from 54, in an ideal world. The log penalty between 51 and 54 is -4.8.

And you see the inverse situation - the linear weights would overcome that penalty and the two concepts would both be linked (but also alongside with the definition from 03), or the geometric metric would be too penalizing and the concept won't be linked.

In a way, what metric one uses also depends on how big a textual snippet we expect to be working with - the smaller the text, the smaller the context and the more permissive the weighing ought to be. As the text size grows, the weights should become smaller and smaller, as we will have more data points as evidence.

Ah! But I can actually do that - I can make the weighing parametric in the text length. Interesting, let me experiment.

dginev commented 11 years ago

Hm, but when I revisit the results on the bigger article, rather than the excerpt it seems somewhat satisfactory.

[NNexus::Classification] Eligible concepts: 76
[NNexus::Classification] Disambiguated concepts: 20
Linking "composition" with: http://planetmath.org/countingcompositionsofaninteger
Linking "function" with: http://planetmath.org/relationtheory
Linking "coordinate" with: http://planetmath.org/coordinatevector
Linking "coordinate" with: http://planetmath.org/frame
Linking "interval" with: http://planetmath.org/interval
Linking "intermediate value theorem" with: http://planetmath.org/intermediatevaluetheorem
Linking "closed ball" with: http://planetmath.org/metricspace
Linking "lying on" with: http://planetmath.org/incidencegeometry
Linking "hyperplane" with: http://planetmath.org/linearmanifold
Linking "hyperplane" with: http://planetmath.org/constructingnearlinearspacesfromexistingones
Linking "unit vector" with: http://planetmath.org/unitvector
Linking "bounded" with: http://planetmath.org/bounded
Linking "bounded" with: http://planetmath.org/topologyofthecomplexplane
Linking "bounded" with: http://planetmath.org/bounded1
Linking "measurable" with: http://planetmath.org/riemannmultipleintegral
Linking "antipodal points" with: http://planetmath.org/antipodal
Linking "continuous function" with: http://planetmath.org/continuous
Linking "Borsuk-Ulam theorem" with: http://planetmath.org/borsukulamtheorem
Final Annotation contains 18 concepts.

Here the only overlinked concepts are coordinate, hyperplane and bounded.

I am actually willing to argue this linking is _correct_ for the article.

What would you say?

The variation of the algorithm which achieved this linking:

dginev commented 11 years ago

Btw, I committed the intermediate fixed index (just missing Mathworld) and the latest state described in the comment above, so feel free to update and test / play around for yourself.

holtzermann17 commented 11 years ago

It's hard to have our cake and eat it to - we either get "boxed in" a very small group of MSC classes, or we end up overlinking with a variety of definitions from said classes...

One option (which I think would arguably be "the PlanetMath way") would be to be generous with overlinks, but add JOBAD features to allow users to report the overlinked or mis-linked terms back to NNexus -- using crowdsourcing to improve the interlinking algorithms.

(I'm not yet sure how we would actually use this sort of feedback -- I know you've wanted to stay clear of individual-term-level curation, but it seems worth throwing this out there.)

A related feature would be to specify a threshold of confidence, so that we class links by confidence, but only show the more dubious ones if they are explicitly requested by the user. A sort of auto-adjustable NNexus glasses "prescription" (so to speak).

dginev commented 11 years ago

One option (which I think would arguably be "the PlanetMath way") would be to be generous with overlinks, but add JOBAD features to allow users to report the overlinked or mis-linked terms back to NNexus -- using crowdsourcing to improve the interlinking algorithms.

Since I have a heuristic algorithm right now, I can't use any feedback in a meaningful way. Also, I am still of the opinion that the dataset is too small to have real impact from user feedback. What can be done as feedback, however, is to add corrections saying "never link this" and "link that", which the author can incorporate via LaTeX macros (\nolink and \pmlinkname for example).

(I'm not yet sure how we would actually use this sort of feedback -- I know you've wanted to stay clear of individual-term-level curation, but it seems worth throwing this out there.)

I am all for individual per-term curation. What I have been reluctant to support is the per-class curation that steers entire MSC classes via the link policies. I think that mechanism is too abstract for authors to control reliably (but that's just an intuition).

A related feature would be to specify a threshold of confidence, so that we class links by confidence, but only show the more dubious ones if they are explicitly requested by the user. A sort of auto-adjustable NNexus glasses "prescription" (so to speak).

Yeah, that would be cool actually... adjusting the confidence level to a value you feel best about. Ideally doing that at view time... And maybe saving an average threshold per-user from what their usual selections end up being (as a Planetary feature). Expert users would only want very relevant terms, novices would like links everywhere (I suspect).

Which makes can be approached in two ways:

Clearly 1 is more feasible in the short-run. I would like to think that my code is structured enough to be easily portable to other languages, if that ends up more preferable in the long run. But downloading a 10-20 MB database in a JS script might be a bit too much for casual users, the web server hides all that.

holtzermann17 commented 11 years ago

Downloading a 10-20 MB database in a JS script might be a bit too much for casual users, the web server hides all that.

I'm reminded of the discussions I had with @jucovschi when he was first specifying features for the real-time editor and "math bots". If I'm remembering correctly, we had in mind a very similar application -- technical termspotting. Almost surely we should not be downloading the database to the client -- it was this idea that initially motivated real-time text analysis via "bots". I think the same paradigm applies for interactions outside of the editor. These aren't small feature requests for Planetary or whatever other client -- but "real-time in the browser" interactions seem to be more and more popular... and maybe we can think about a general-purpose "NNexus client" library (much smaller than 1MB) that would provide an API that could be used w/in Planetary or the editor (or whatever).

dginev commented 11 years ago

A NNexus client-side library is a very good idea, ideally as a JOBAD module.

dginev commented 11 years ago

Could you take a look if the article has "acceptable" links, in the light of our discussion and close the issue if so? You can open a new one for any other article that draws your attention. I will add the Ham Sandwich example to the test suite in the meantime.

holtzermann17 commented 11 years ago

Linking seems fine to me.