brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
933 stars 99 forks source link

Generate RDFa instead of idiosyncratic CSS classes for metadata like authorship #102

Closed clange closed 14 years ago

clange commented 14 years ago

[Originally Ticket 1426]

In an e-mail thread on combining a FOAF-like semantic web tagging machinery (developed by Alexander García Castro, Uni Bremen) with LaTeXML/LaMaPUn/arXMLiv, Deyan wrote:

The information (about author metadata) is already present in our intermediate XML representations, by the time it reaches XHTML it is transformed into a non-standard division-based scheme. E.g. you would typically find an author into a

element, etc. We have our own branch of LaTeXML and in principle it should be a straightforward change to replace this division-based scheme with an RDFa one, if you find it necessary. Bruce could also be interested in this, given he endorses ubiquitous RDFa. I do not recommend using classes as default, actually I wouldn't use them at all. This microformat-like syntax would only be required if we strictly depended on HTML 4 compatibility. But our output is XHTML, so we can technically afford RDFa. Semantically, RDFa would be a great benefit anyway, as it allows us to refer to standard vocabularies for modeling e.g. authorship, such as Dublin Core. OK, if our target is to be HTML5 we should not forget Microdata, if they "win" against RDFa in the ongoing conflict; see http://kwarc.info/blog/2009/10/28/microdata-vs-rdfa/. But it would be easy to switch, and maybe there should even be both output options, Microdata for modern HTML browsers, and RDFa for semantic web architectures.

@Bruce, is there anything that holds against RDFa output of that information?

brucemiller commented 14 years ago

I'm not sure we're all on the same page here: a notion of "class" is used in LaTeXML at roughly 2 levels. At the first level, within latexml proper, it is used as somewhat of a microformat, particularly (1) for distinguishing different instances of a markup element, where introducing a new element would be too "heavy". For example, there's a theorem element, but class is used to distinguish all the variations (lemma, etc). But note that the word "lemma" is coming from a deftheorem declaration that defines lemma as a new environment.

Then there are the classes used in xhtml, which are also microformat oriented, being the combination of any existing classes (see above), and also classes added by the XSLT that (2) records which latexml element generated the html element. (eg. a theorem becomes an html div, so then a lemma environment will have class='theorem lemma')

I don't want to change (2) since there's a direct correspondence to the element names, and also for (1) the names used directly correspond to source markup that it seems to me should be preserved.

Yet, conceivably the vocabulary used in some places could be adapted to be more rdf friendly, and i can see that as being potentially useful. Do you have concrete suggestions of places where better choices could be made?

clange commented 14 years ago

Oops, I forgot to reply to this ticket in time. Please allow me to do it nevertheless. Replying to comment 1 @brucemiller:

I'm not sure we're all on the same page here: a notion of "class" is used in LaTeXML at roughly 2 levels. At the first level, within latexml proper, it is used as somewhat of a microformat, particularly (1) for distinguishing different instances of a markup element, where introducing a new element would be too "heavy". Anything that is a microformat or any other proprietary way of lightweight semantic markup already can more adequately be replaced by RDFa. The only disadvantage of RDFa compared to microformats is that it is a bit harder to write, but we don't have that problem here, as we generate the XHTML automatically. For example, there's a theorem element, but class is used to distinguish all the variations (lemma, etc). But note that the word "lemma" is coming from a deftheorem declaration that defines lemma as a new environment. I see. Using RDFa with the OMDoc ontology would still work, but require some heuristics. Regardless of where they are applied (I don't know the LaTeXML architecture well enough to decide on that – @Deyan, @Michael, can you help me to translate my language into Bruce's?), be it in LaTeXML or in a postprocessing XSLT, it could be something like:

If the environment has been declared via deftheorem, then

If everything stays as it is now, it will also be feasible to implement an XSLT that post-processes class="theorem lemma" into class="theorem lemma" typeof="http://omdoc.org/ontology#Lemma".

brucemiller commented 14 years ago

Didn't mean to cut off discussion; just shorten the Nag list I get from trac! :>

I'm certainly for semantically enriched LaTeX markup, but there's only so much we can do from within the standard classes. Again with the theorem example, there's a macro to define a theorem environment; you can define a theorem-like environment with any name you want. It need not be as suggestive as "lemma", and in fact we don't even know if the author had primarily presentation or semantics in mind when he/she defined the environment! let alone whether the name corresponds to something defined in rdf.

So, I think we're together on the point that a (mostly) presentational class should be derived much like it is now, but that allowance for rdf markup can be made as well. sTeX would likely do that from within the markup, while other heuristics could be applied, as you suggest, in postprocessing or analysis stages.

Is any particular support for this needed? What would it look like?

clange commented 14 years ago

Replying to comment 4 @brucemiller:

Didn't mean to cut off discussion; just shorten the Nag list I get from trac! :> I fully understand that. It was me after all who initiated those nag lists ;-) I'm certainly for semantically enriched LaTeX markup, but there's only so much we can do from within the standard classes. Of course – my motivation is to expose all semantic information that we can get from the LaTeX input (but not more than that!) in the resulting XHTML+MathML. The main corpus I'm targetting is arXMLiv. In particular, I'm not talking about sTeX here. We can already generate XHTML+MathML+RDFa from sTeX (via OMDoc). Again with the theorem example, there's a macro to define a theorem environment; you can define a theorem-like environment with any name you want. Sure, I know that. It need not be as suggestive as "lemma", and in fact we don't even know if the author had primarily presentation or semantics in mind when he/she defined the environment! Good point. Maybe we should involve Deyan here: @Deyan, let me compare the question whether we should assume that a theorem-like environment named "lemma" really is a lemma (in the sense of http://omdoc.org/ontology#Lemma) to LaTeXML's heuristic generation of content math markup. For LaMaPUn you disabled the latter heuristics in favor of a more thorough linguistic analysis, in order to figure out, e.g., whether an element of a formula is a variable. What would you think about a "lemma" environment defined by an author using \newtheorem? Judging from the arXiv documents, do authors often define theorem-like environments rather for presentational purposes, or do they have semantics in mind? Can we use "duck typing" here? (If it walks like a lemma …) Or should we rather be more cautious and leave the identification of lemmas to linguistic analysis? (Note that "lemma" is just an example, actually I'm talking about anything that can be written in standard LaTeX and that could be mapped to some concept from the OMDoc ontology.) let alone whether the name corresponds to something defined in rdf.

So, I think we're together on the point that a (mostly) presentational class should be derived much like it is now, but that allowance for rdf markup can be made as well. sTeX would likely do that from within the markup, while other heuristics could be applied, as you suggest, in postprocessing or analysis stages. @Deyan, I may need your help here. As I said before, I have no idea about what would be the best place for implementing this. Is any particular support for this needed? What would it look like? I have no idea. For a better understanding, let me outline the application I have in mind:

I would like the arXMLiv documents to be automatically annotated with lightweight RDFa, so that we, or existing engines, can crawl them. To show you that there is already some existing infrastructure for that, I have added a bit of RDFa to my homepage:

I would like such basic semantic annotations to be available in the XHTML documents generated from arXiv. Then we would be able to answer queries like "how many lemmas do the documents authored by a person named 'Paul Erdős' contain"? In SPARQL:

SELECT COUNT(?lemma) WHERE {
  ?document dc:creator "Paul Erdős" ;
    oo:hasPart [ a oo:Lemma ] .
}

With more semantic annotations, which we can't get from plain LaTeX, but which we might eventually get from LaMaPUn, and which we can already get from sTeX, we will be able to answer more complex queries and enable more services. See https://svn.omdoc.org/repos/jomdoc/doc/pubs/eswc-demo10/gencs-lod.pdf, http://kwarc.info/kohlhase/submit/mkm10-multiform.pdf and http://kwarc.info/kohlhase/submit/iSemantics10.pdf.

But there is a growing community around publishing RDF on the web (see http://linkeddata.org). Even if the arXMLiv documents would only allow for a shallow annotation, I'm sure that their huge number would attract people figuring out further applications.

kohlhase commented 14 years ago

As the markup that latexml creates is very systematic, we should just (at least for the moment) convert it to the RDFa form in the post-processing step (for the arXMLiv corpora for XHTML). Then we can see what the result is worth, and if it is, then Bruce can still decide whether he wants to generate it directly.

I guess if we make the style sheet extension sufficiently modular, then we can also use it to upgrade the NOPARSE verions (if we want the RDFa there).

kohlhase commented 14 years ago

If this suggestion has your support, then we should probably re-open the issue and assign it to Deyan and/or Christoph to make the respective extensions in the style sheets in the arXMLiv branch.

brucemiller commented 14 years ago

If you're asking me, I have no objection. If there's a way of enhancing the output, I'm all for it!