dginev / nnexus

Auto-linking for Mathematical Concepts for PlanetMath.org, Wikipedia, and beyond.
MIT License
18 stars 3 forks source link

Revisiting Invalidation #16

Closed holtzermann17 closed 11 years ago

holtzermann17 commented 11 years ago

Some people are (rightly?) expecting to see automatic links appear right away, which seems like a case for a PUSH API. http://planetmath.org/node/87298#comment-19357

dginev commented 11 years ago

Right. That is a feature we must enable for sure.

I think I figured out how I can have my cake and eat it too, btw, so once my backend work is done, the push API will shortly follow.

dginev commented 11 years ago

My hand is better, so back to work. I am looking at the invalidation of the old NNexus and I am going to essentially throw it out the window - it is a wonderful example of hard-coupled code that relies on NNexus having all articles stored inside it (or at least that's the impression I'm getting, it's not easy to read code).

What I have already solved (conceptually) is invalidating in the negative direction -- i.e. once a concept exists and is altered and/or deleted, any URL that was previously linked with that concept will be marked as needing a re-linking and returned by the service.

The hard part is invalidating in the positive direction, i.e. a new concept was just introduced and now we want to re-link all previously linked articles that contain it. In the brute force perspective, this means all phrases from linked documents need to be indexed internally by NNexus. A slightly less brute force method is to do a basic term-likelihood analysis (e.g. computing mutual information over the corpus of indexed documents). The original code did something on those lines as well, indexing only likely phrases. Ideally in the long run there will be a proper term-likelihood analysis that will provide all possible term-like phrases and we can treshold at some probability to obtain dangling links (high probability) or just index in the DB for possible future invalidation (medium probability).

I will implement the basic version of this mechanism now.

holtzermann17 commented 11 years ago

Oof. Sounds complicated!

dginev commented 11 years ago

I would be super happy to invent something simpler :) But change management and loose coupling aren't best friends for sure.

holtzermann17 commented 11 years ago

It would could potentially be useful to get comments from Michael Kohlhase about whether this integrates meaningfully with his concept of the Glossary project. Last time in was in Bremen, he said he didn't think so. But somehow I'm viewing the NNexus data structure as the glossary. (I know, you told me I have hammer syndrome about NNexus.) Maybe it would be useful to have some technical docs w/ pictures that show just what the NNexus data structure is? Maybe I just think it's complicated because I'm trying to cram too much into my mental picture.

dginev commented 11 years ago

On talking to Michael: We should definitely do that, I will send him a link to this issue and he'll probably add some comments. The glossary business is itself confusing, and (finally!) this relates to the symbol grounding problem at the heart of my PhD proposal.

( wow, I actually am connecting to my PhD proposal, this must mean I'm getting somewhere! )

What I know Michael agrees on is that we have a document-centric, corpus-based view on knowledge in all KWARC work, including LaMaPUn, NNexus, PlanetMath, sTeX etc etc. That entails that a "definition" is first and foremost a real-world (well real-virtual-world) resource, identifiable by some URI. In the NNexus case, the granularity is such (document-level metadata) that a "concept" is in practice the same resource as the "definition" and the containing "document". So we have a coarse-grained opaque resource that supposedly contains all the fine-grained distinctions.

Michael's idea of a glossary might be more fine-grained, e.g. in sTeX concepts and their definitions are much nicer to point to and much more directly individual resources, because we have fine-grained semantics.

But apart from that distinction, the basics notions should be aligned. The implementation might vary of course - I told you before my first priority is to make NNexus good at what it is already known to be good at (but with a sane code base). I'll email Michael now and continue brainstorming below.

dginev commented 11 years ago

So, deciding on how to do invalidation (change management, really) is entirely grounded on how we model "concepts" in NNexus.

I mentioned in another ticket that not surprisingly what I already have modelled is close to the Wikidata data model. Essentially, a "concept" is synonymous to "concept definition" in NNexus, and is internally identifiable by a table row in the database, carrying the following information (quoting the code docs):

A 'concept' has a 'firstword', belongs to a 'category' (e.g. 10-XX) with a certain 'scheme' (e.g. MSC) and is defined at a 'link', obtained while traversing an object known via 'objectid'. The concept inherits the 'domain' of the object (e.g. PlanetMath). The distinction between link and objectid allows for a level of indirection, e.g. in DLMF, where we would obtain the 'link's that define concepts while at a higher (e.g. index) webpage, only which we would register in the object table. The reindexing should be driven by the traversal process, while the linking should use the actual obtained URL for the concept definition.

CREATE TABLE concept (
  conceptid integer primary key AUTOINCREMENT,
  firstword varchar(50) NOT NULL,
  concept varchar(255) NOT NULL,
  category varchar(10) NOT NULL,
  scheme varchar(10) NOT NULL DEFAULT 'msc',
  domain varchar(50) NOT NULL,
  link varchar(2053) NOT NULL,
  objectid int(11) NOT NULL
);

That allows for a lot of flexibility in that a single object(=web resource, e.g. PM article) can define many concepts in various categories, as well as avoids conflicts of matching definitions from different domains (Wikipedia integral vs PlanetMath integral).

Now, the question pertaining to change management is: Should link, objectid, domain, scheme and category be mandatory fields? If they are made optional, this table will be able to facilitate "dangling" concepts - phrases we suspect might be concepts but have no definition for (yet). If I take that step I would also add a column for "confidence" which will store the [0,1] probability that the recorded concept is really a valid one. That's the most general extension to the data model that comes to mind.

Alternatively the "seemingly likely" concepts can be added in a separate table(s), and only looked into when the positive part of the invalidation process is performed. That might end up more efficient at the expense of having a less general data model.

dginev commented 11 years ago

Another relevant note is that I am making the invalidation(change management) more fine-grained than it used to be. In the previous implementation the relation was of object to object (i.e. web page to web page), while we can easily make it object (the page receiving the new links) to concept(s) (the concepts in the links).

Then when a single concept is removed/renamed from an indexed PlanetMath article, we will only relink the articles that actually link to that concept, rather than those that link to any concept on that page. The impact is most significant in DLMF where we index, well the DLMF index pages, which store definitions for ~10 or more concepts. It might not be a significant difference in PlanetMath. Which is ok, after all we are decoupled now.

dginev commented 11 years ago

But I think this is overall the right distinction:

And the processes that operate with concepts are:

I will ponder on this some more and see if I am missing something...

dginev commented 11 years ago

I have decided to make separate tables ("concept" and "candidate") and processing classes for the defined and dangling concepts, so that I don't get myself confused and make sure the initial implementation is efficient.

Once things stabilize I will revisit the idea of a common data model.

( I should remind myself to move all relevant parts of the discussion here into the Manual )

dginev commented 11 years ago

renamed ticket to "Revisiting Invalidation"

dginev commented 11 years ago

This ticket ended up very close in essence to #12 , they should be co-requisites.

dginev commented 11 years ago

The plan of attack is now clear, and the invalidation (change management) of already indexed concepts is operational (see t/05 for an example). The rest is scheduled for the June release, see #27 .

Closing here.