Closed jukowski closed 5 years ago
I will try to sneak this in for the 2.3 release (scheduled for the end of 2013).
Could you let me know how I can "mine" the concepts from that link? E.g. are there HTML pages that can be crawled for each entry? I can't seem to immediately see the order in the site.
The way NNexus treats concept, each concept is defined in terms of a URL, so I need a web-accessible definition which I can both crawl and then point to when doing the linking. Thanks!
Maybe @kohlhase also has pointers?
I am afraid that mining the SMGLoM is not as simple, since the "words" are quite deeply encoded. We should probably generate a word list (with URIs) from MMT, what is the format you would like it in? I will talk with Mihnea about this on monday.
but if you really want to mine, there are two formats: sTeX (and there the words are in \defi \defii, \adefi, ... (description in the sTeX manual for statements.dtx).
NNexus auto-linking is all about the "links" - if there are no HTML pages that present the concepts in SMGLO, then there is no point to index with NNexus in the first place.
If you can generate HTML for every entry and either have it consistently organized (the way Wikipedia and DLMF do) or alternatively RDFa enriched (like PlanetMath and MathWorld are), then there is added value in indexing it with NNexus. Linking to TeX files sounds underwhelming.
Deyan, I have the feeling that you misunderstand the way Constantin wants to use NNexus. He wants to use it in editing for generating termrefs, and that is in sTeX sources, so in essence he wants to reference sTeX sources.
But we (Mihnea) can indeed generate html consistently, and eventually, this should be used with NNexus, but I think we should really wait a bit more before we really do that.
Well, do we agree that parsing TeX is a bad idea in principle? Especially when you have an OMDoc and HTML export?
If so, what you need is to preserve the information you want to index to the format of choice. So far NNexus can only index on top of HTML files, but it should be possible to index OMDoc just as easily.
Another thing is that the database schema for NNexus concepts has already been fixed. As I mentioned, NNexus sees concepts internally as the tuple (natural language word/phrase, URL) and possibly additional categorical information (MSC class, etc).
If you think you can work with that setup, I'd be happy to help with the indexing. Otherwise, you probably want your own tool rather than NNexus.
how can we progress on this? The Local Math Hub (lmh) tool can generate OMDoc and HTML. I will send the omdocs by email. If there are easy steps I need to perform, I would do it also myself...
It needs to be served on the web, so that a crawler can index it, as I mentioned before. If you can get to that stage, I can write a small indexer (given there is indexable content on your pages).
Maybe in its next incarnation...
Would be nice to have the glossary from http://mathhub.info/smglom/smglom indexed so that NNexus can be integrated into online editors and jEdit plugin that authors of SMGLO can use.