Index SMGLO - Githubissues

dginev / nnexus

Auto-linking for Mathematical Concepts for PlanetMath.org, Wikipedia, and beyond.

MIT License

18 stars 3 forks source link

Index SMGLO #49

Closed jukowski closed 5 years ago

jukowski commented 10 years ago

Would be nice to have the glossary from http://mathhub.info/smglom/smglom indexed so that NNexus can be integrated into online editors and jEdit plugin that authors of SMGLO can use.

dginev commented 10 years ago

I will try to sneak this in for the 2.3 release (scheduled for the end of 2013).

Could you let me know how I can "mine" the concepts from that link? E.g. are there HTML pages that can be crawled for each entry? I can't seem to immediately see the order in the site.

The way NNexus treats concept, each concept is defined in terms of a URL, so I need a web-accessible definition which I can both crawl and then point to when doing the linking. Thanks!

Maybe @kohlhase also has pointers?

kohlhase commented 10 years ago

I am afraid that mining the SMGLoM is not as simple, since the "words" are quite deeply encoded. We should probably generate a word list (with URIs) from MMT, what is the format you would like it in? I will talk with Mihnea about this on monday.

kohlhase commented 10 years ago

but if you really want to mine, there are two formats: sTeX (and there the words are in \defi \defii, \adefi, ... (description in the sTeX manual for statements.dtx).

dginev commented 10 years ago

NNexus auto-linking is all about the "links" - if there are no HTML pages that present the concepts in SMGLO, then there is no point to index with NNexus in the first place.

If you can generate HTML for every entry and either have it consistently organized (the way Wikipedia and DLMF do) or alternatively RDFa enriched (like PlanetMath and MathWorld are), then there is added value in indexing it with NNexus. Linking to TeX files sounds underwhelming.

kohlhase commented 10 years ago

Deyan, I have the feeling that you misunderstand the way Constantin wants to use NNexus. He wants to use it in editing for generating termrefs, and that is in sTeX sources, so in essence he wants to reference sTeX sources.

But we (Mihnea) can indeed generate html consistently, and eventually, this should be used with NNexus, but I think we should really wait a bit more before we really do that.

dginev commented 10 years ago

Well, do we agree that parsing TeX is a bad idea in principle? Especially when you have an OMDoc and HTML export?

If so, what you need is to preserve the information you want to index to the format of choice. So far NNexus can only index on top of HTML files, but it should be possible to index OMDoc just as easily.

Another thing is that the database schema for NNexus concepts has already been fixed. As I mentioned, NNexus sees concepts internally as the tuple (natural language word/phrase, URL) and possibly additional categorical information (MSC class, etc).

If you think you can work with that setup, I'd be happy to help with the indexing. Otherwise, you probably want your own tool rather than NNexus.

jukowski commented 10 years ago

how can we progress on this? The Local Math Hub (lmh) tool can generate OMDoc and HTML. I will send the omdocs by email. If there are easy steps I need to perform, I would do it also myself...

dginev commented 10 years ago

It needs to be served on the web, so that a crawler can index it, as I mentioned before. If you can get to that stage, I can write a small indexer (given there is indexable content on your pages).

dginev commented 5 years ago

Maybe in its next incarnation...