dginev / nnexus

Auto-linking for Mathematical Concepts for PlanetMath.org, Wikipedia, and beyond.
MIT License
18 stars 3 forks source link

Change management #12

Closed dginev closed 11 years ago

dginev commented 11 years ago

After the port is complete, we should put in place some change management in place.

NNexus 1.0 uses an invalidation index that allows it to instantly discover and report invalidation entries, at the price of a tight integration with Noosphere and a very heavy-weight data storage (it essentially carries a redundant copy of the PlanetMath sources).

Once we have the PULL API in place, a more loosely integrated change management approach should be considered, possibly based on caching term footprints of the scanned articles.

dginev commented 11 years ago

I have read some more on the invalidation index that the old NNexus has. In essence, each article is being cached in the backend, as well as a full term footprint of it (which I am not 100% clear on the exact specifics of). The claim in the papers is that only "likely" future terms are being cached, i.e. no phrase of great size will be recorded. The method hinted at has to do with frequency of occurrences in the indexed corpus.

Here are my thoughts on getting something on those lines in the NNexus Reloaded codebase:

Basic idea: make the response to an indexing request a list of URLs that should be considered for re-linking. As we are loosely integrated, the final call whether something should be re-linked should be in Planetary and should be done on-demand.

Tricky:

(re-)caching the terms of auto-linked articles, in an efficient and scalable manner.

I am wondering whether the "chained concept hash" should be kept from the original implementation as-is, whether it should hold only the indexed terms, or also the possible/dangling terms. Actually possible and dangling terms are yet another distinction. So we need three categories of phrase-like structures:

In any case, we should probably advise for a cron job that re-links the entire content of a site (PlanetMath or other) on a weekly/monthly basis, so that the just-in-time invalidation isn't the only refresh mechanism. But it should be reliable enough ... Maybe it already is?

dginev commented 11 years ago

One important Planetary use case would be re-indexing a freshly modified article. The workflow then would be:

A tricky thing to keep in mind are synonyms. Maybe the most straightforward approach to dealing with them is to compile them down to independent concepts of their own. That should do the trick.

dginev commented 11 years ago

More thoughts on the invalidation and concept indices.

dginev commented 11 years ago

Oh, regarding whether to have a 15 million entry table or compute the MSC distance on the fly - the answer is probably in the middle. Once computed on-the-fly, the distances can be cached in the database. This way we don't create bloat between categories that will rarely be compared, we keep the table small and we still cache the most common comparisons.

dginev commented 11 years ago

This ticket ended up very close in essence to #16 , they should be co-requisites.

dginev commented 11 years ago

The plan of attack is now clear, and the invalidation (change management) of already indexed concepts is operational (see t/05 for an example). The rest is scheduled for the June release, see #27 .

Closing here.