brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
918 stars 97 forks source link

recent LaTeXML conversions cause PDOException under Drupal 7.15 #333

Closed holtzermann17 closed 11 years ago

holtzermann17 commented 11 years ago

[Originally Ticket 1657]

In short, it seems like any non-ASCII characters in the returned expression are likely to cause problems. It is possible to "work around" this by using utf8_encode(...) on the returned XHTML expression before saving it to the database, but so far the work around just ends up causing more problems.

Here's a detailed example. It might be hard to discern just what the problem is from outside of PHP. For the moment, I'm pasting the full error message here in case it offers some clue: http://pastebin.com/4h1FjZeL.

One way to reliably produce a similar error message on Drupal without actually running LaTeXML is to create a "Basic Page", and paste in the "result" that you see in that pastebin. If I can help with debugging at that level just let me know!

Here's a command line version of what exactly what Drupal posts to the LaTeXML daemon (the non-URL-encoded versions of the content follow):

curl 'http://latexml.planetmath.org/convert' -d
'profile=planetmath&tex=The+%5Cemph%7BH%5C%22older+inequality%7D+concerns+%5Cemph%7Bvector+p-norms%7D%3A+given+%241+%5Cleq+p%24%2C+%24q+%5Cleq+%5Cinfty%24%2C%0A%0A%5Cbegin%7Bdisplaymath%7D%0A++++%5Cmbox%7BIf+%7D%5Cfrac%7B1%7D%7Bp%7D%2B%5Cfrac%7B1%7D%7Bq%7D%3D1%5Cmbox%7B+then+%7D%7Cx%5ETy%7C+%5Cleq+%7C%7C%5C%2Cx%5C%2C%7C%7C_p%7C%7C%5C%2Cy%5C%2C%7C%7C_q%0A%5Cend%7Bdisplaymath%7D%0A%0AAn+important+instance+of+a+H%5C%22older+inequality+is+the+%5Cemph%7BCauchy-Schwarz+inequality%7D.%0A%0AThere+is+a+version+of+this+result+for+the+%5CPMlinkname%7B%24L%5Ep%24+spaces%7D%7BLpSpace%7D.%0AIf+a+function+%24f%24+is+in+%24L%5Ep%28X%29%24%2C+then+the+%24L%5Ep%24-norm+of+%24f%24+is+denoted%0A%24%7C%7C%5C%2Cf%5C%2C%7C%7C_p%24.%0AGiven+a+measure+space+%24%28X%2C%5Cmathfrak%7BB%7D%2C%5Cmu%29%24%2C+if+%24f%24+is+in+%24L%5Ep%28X%29%24+and+%24g%24+is+in+%24L%5Eq%28X%29%24+%28with+%241%2Fp+%2B+1%2Fq+%3D+1%24%29%2C+then%0Athe+H%5C%22older+inequality+becomes%0A%0A%5Cbegin%7Beqnarray%2A%7D%0A%5CVert+fg%5CVert_1+%3D+%5Cint_X+%5Cvert+fg%5Cvert+%5Cmathrm%7Bd%7D%5Cmu+%0A++++++++++++++++++++++%26+%5Cle+%26+%0A%5Cleft%28%5Cint_X%7Cf%7C%5Ep%5Cmathrm%7Bd%7D%5Cmu%5Cright%29%5E%7B%5Cfrac%7B1%7D%7Bp%7D%7D%0A%5Cleft%28%5Cint_X%7Cg%7C%5Eq%5Cmathrm%7Bd%7D%5Cmu%5Cright%29%5E%7B%5Cfrac%7B1%7D%7Bq%7D%7D%5C%5C%0A%26+%3D+%26+%5CVert+f%5CVert_p%5C%2C%5CVert+g+%5CVert_q+%0A%5Cend%7Beqnarray%2A%7D%25%0A%25%25%25%25%25%0A%25%25%25%25%25whywhywhy&preamble=literal:%5Cusepackage%7Bamssymb%7D%0D%0A%5Cusepackage%7Bamsmath%7D%0D%0A%5Cusepackage%7Bamsfonts%7D%0A%5Cpmcanonicalname%7BHolderInequality%7D%0A%5Cpmcreated%7B2012-11-19+22%3A51%3A38%7D%0A%5Cpmmodified%7B2012-11-19+22%3A51%3A38%7D%0A%5Cpmowner%7BPrimeFan%7D%7B1%7D%0A%5Cpmmodifier%7BPrimeFan%7D%7B1%7D%0A%5Cpmtitle%7BH%5C%22older+inequality%7D%0A%5Cpmrecord%7B1%7D%7B30094%7D%0A%5Cpmauthor%7BPrimeFan%7D%7B1%7D%0A%5Cpmtype%7BTheorem%7D%0A%5Cpmcomment%7BTrigger+PyRDFa+anew%7D%0A%5Cpmclassification%7Bmsc%7D%7B15A60%7D%0A%5Cpmclassification%7Bmsc%7D%7B55-XX%7D%0A%5Cpmclassification%7Bmsc%7D%7B46E30%7D%0A%5Cpmclassification%7Bmsc%7D%7B42B10%7D%0A%5Cpmclassification%7Bmsc%7D%7B42B05%7D%0A%5Cpmsynonym%7BHolder+inequality%7D%7BHolderInequality%7D%0A%5Cpmsynonym%7BHoelder+inequality%7D%7BHolderInequality%7D%0A%25%5Cpmkeywords%7Bvector%7D%0A%25%5Cpmkeywords%7Bnorm%7D%0A%5Cpmrelated%7BVectorPnorm%7D%0A%5Cpmrelated%7BCauchySchwartzInequality%7D%0A%5Cpmrelated%7BCauchySchwarzInequality%7D%0A%5Cpmrelated%7BProofOfMinkowskiInequality%7D%0A%5Cpmrelated%7BConjugateIndex%7D%0A%5Cpmrelated%7BBoundedLinearFunctionalsOnLpmu%7D%0A%5Cpmrelated%7BConvolutionsOfComplexFunctionsOnLocallyCompactGroups%7D%0A%5Cpmrelated%7BLpNormIsDualToLq%7D%0A%0A%5Cbegin%7Bdocument%7D'

preamble:

\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{amsfonts}

document:

The \emph{H\"older inequality} concerns \emph{vector p-norms}: given
$1 \leq p$, $q \leq \infty$,

\begin{displaymath}
    \mbox{If }\frac{1}{p}+\frac{1}{q}=1\
mbox{ then }|x^Ty| \leq
||\,x\,||_p||\,y\,||_q
\end{displaymath}

An important instance of a H\"older inequality is the
\emph{Cauchy-Schwarz inequality}.

There is a version of this result for the \PMlinkname{$L^p$ spaces}{LpSpace}.
If a function $f$ is in $L^p(X)$, then the $L^p$-norm of $f$ is denoted
$||\,f\,||_p$.
Given a measure space $(X,\mathfrak{B},\mu)$, if $f$ is in $L^p(X)$
and $g$ is in $L^q(X)$ (with $1/p + 1/q = 1$), then
the H\"older inequality becomes

\begin{eqnarray*}
\Vert fg\Vert_1 = \int_X \vert fg\vert \mathrm{d}\mu
                      & \le &
\left(\int_X|f|^p\mathrm{d}\mu\right)^{\frac{1}{p}}
\left(\int_X|g|^q\mathrm{d}\mu\right)^{\frac{1}{q}}\\
& = & \Vert f\Vert_p\,\Vert g \Vert_q
\end{eqnarray*}
holtzermann17 commented 11 years ago

Oh, of course, if you do try to reproduce this by creating a Basic Page, you should use the Full HTML text format.

brucemiller commented 11 years ago

I certainly want to help solve the problem, but it is hard to see it as a LaTeXML issue, as such, unless it is producing somehow invalid UTF8, or the output doesn't have the right declarations of utf-ness in it, or something similar.

holtzermann17 commented 11 years ago

Using the above procedure with Basic Pages, I can narrow the Minimal (not) Working Example down to this:

a measure space $(X,\mathfrak{B},\mu)$,..

or indeed, this:

$\mathfrak{B}$

With an old LaTeXML version, I get back something like:

<math alttext="\mathfrak{B}" display="inline"><semantics>
<mi mathvariant="fraktur">B</mi>
<annotation-xml encoding="MathML-Content">
<ci xmlns="http://www.w3.org/1998/Math/MathML">B</ci>
</annotation-xml><annotation encoding="application/x-tex">
\mathfrak{B}</annotation></semantics></math>

This renders, but without the Fraktur font: http://beta.planetmath.org/testencoding

Whereas with a newer LaTeXML, I get back some XHTML that includes the unicode character "U+1d505".

This suggests that it is actually a certain subset of (like "U+1d505") that are causing trouble for PHP/PDO. Presumably it should work, in other words, my database is still set up wrong, but at least I've traced it down to something minimal.

dginev commented 11 years ago

So you're saying that LaTeXML produces valid UTF-8, but that chokes MySQL? That makes sense and won't be the first time we hit that problem.

brucemiller commented 11 years ago

AH! Now I see what's going on. Yes, LaTeXML changed to produce Plane 1 characters for styled math symbols by default, as opposed to optionally. You can turn that off using the ---noplane1.

But I'd recommend you only turn it off for testing; You kinda want the new default behaviour, since otherwise browsers without enough of the right fonts will show a plain "B" instead of the fraktur B. Or test and see which way you prefer.

Plane1 is > 16bit, so more likely to stress some applications.

holtzermann17 commented 11 years ago

Yes... it seems like Drupal/PDO/MySQL is having trouble with this new Plane 1 UTF-8.

I posted a question on the Drupal !StackExchange site, which takes LaTeXML out of the loop for now, because we're down to a one-character MWE.

http://drupal.stackexchange.com/questions/50868/configuring-drupal-to-use-unicode-characters

Maybe someone there will know what to do next!

But, actually it seems that now that I know the right search terms, the answer might be here:

http://stackoverflow.com/questions/11936950/inserting-utf-8-encoded-string-into-utf-8-encoded-mysql-table-fails-with-incorr

''MySQL charset utf8 only accepts UTF-8 characters if they can be represented in 3 bytes. If you need to store this in MySQL, you'll need to use MySQL charset utf8mb4.''

... In which case, this issue could potentially end up affecting a lot of LaTeXML users, so I think it's a good thing I've been discussing with you.

dginev commented 11 years ago

Joe, is this ready to be closed?

dginev commented 11 years ago

I think we've established the core issue regarded MySQL and UTF-8 and we'll keep in mind relaying that information to any other derivative applications around LaTeXML. Ideally we could add a footnote or so in the LaTeXML manual?

In any case, closing this issue.

brucemiller commented 11 years ago

Sure; although it's one of those obvious, once you realize it items. Where in the manual would it be noticeable/findable?