commonmark / cmark

CommonMark parsing and rendering library and program in C
Other
1.6k stars 534 forks source link

Add sourcepos to latex output #490

Open jeroen opened 7 months ago

jeroen commented 7 months ago

There has been a request from the R community to support some sort of sourcepos feature for latex output, such that we can map problems/typos in latex back to the input md (similar to HTML).

Is this something that we may want to support support? If so, what would be a good format? Given that sourcepos is not part of the spec and opt-in, hopefully we have some room to experiment.

This PR shows the most simple example that I could come up with. It is similar to the html version of render_sourcepos but instead adds a latex comment % before every CR (except for within verbatim) with the sourcepos numbers.

Below example output of Test.md.

\section{Markdown: Syntax} %sourcepos(1:1-1:18)

\begin{itemize} %sourcepos(3:1-23:0)
\item \protect\hyperlink{overview}{Overview} %sourcepos(3:5-3:25)

\begin{itemize} %sourcepos(4:5-6:64)
\item \protect\hyperlink{philosophy}{Philosophy} %sourcepos(4:9-4:33)

\item \protect\hyperlink{html}{Inline HTML} %sourcepos(5:9-5:28)

\item \protect\hyperlink{autoescape}{Automatic Escaping for Special Characters} %sourcepos(6:9-6:64)

\end{itemize} %sourcepos(4:5-6:64)

\item \protect\hyperlink{block}{Block Elements} %sourcepos(7:5-7:28)

\begin{itemize} %sourcepos(8:5-13:31)
\item \protect\hyperlink{p}{Paragraphs and Line Breaks} %sourcepos(8:9-8:40)

\item \protect\hyperlink{header}{Headers} %sourcepos(9:9-9:26)

\item \protect\hyperlink{blockquote}{Blockquotes} %sourcepos(10:9-10:34)

\item \protect\hyperlink{list}{Lists} %sourcepos(11:9-11:22)

\item \protect\hyperlink{precode}{Code Blocks} %sourcepos(12:9-12:31)

\item \protect\hyperlink{hr}{Horizontal Rules} %sourcepos(13:9-13:31)

\end{itemize} %sourcepos(8:5-13:31)

\item \protect\hyperlink{span}{Span Elements} %sourcepos(14:5-14:26)

\begin{itemize} %sourcepos(15:5-18:22)
\item \protect\hyperlink{link}{Links} %sourcepos(15:9-15:22)

\item \protect\hyperlink{em}{Emphasis} %sourcepos(16:9-16:23)

\item \protect\hyperlink{code}{Code} %sourcepos(17:9-17:21)

\item \protect\hyperlink{img}{Images} %sourcepos(18:9-18:22)

\end{itemize} %sourcepos(15:5-18:22)

\item \protect\hyperlink{misc}{Miscellaneous} %sourcepos(19:5-19:26)

\begin{itemize} %sourcepos(20:5-23:0)
\item \protect\hyperlink{backslash}{Backslash Escapes} %sourcepos(20:9-20:39)

\item \protect\hyperlink{autolink}{Automatic Links} %sourcepos(21:9-21:36)

\end{itemize} %sourcepos(20:5-23:0)

\end{itemize} %sourcepos(3:1-23:0)

\textbf{Note:} This document is itself written using Markdown; you
can {see the source for it by adding \textquotesingle{}.text\textquotesingle{} to the URL}. %sourcepos(24:1-25:89)

\begin{center}\rule{0.5\linewidth}{\linethickness}\end{center} %sourcepos(27:1-28:0)

\subsection{Overview} %sourcepos(29:1-29:11)

\subsubsection{Philosophy} %sourcepos(31:1-31:14)

Markdown is intended to be as easy-to-read and easy-to-write as is feasible. %sourcepos(33:1-33:76)

Readability, however, is emphasized above all else. A Markdown-formatted
document should be publishable as-is, as plain text, without looking
like it\textquotesingle{}s been marked up with tags or formatting instructions. While
Markdown\textquotesingle{}s syntax has been influenced by several existing text-to-HTML
filters -{}- including \href{http://docutils.sourceforge.net/mirror/setext.html}{Setext}, \href{http://www.aaronsw.com/2002/atx/}{atx}, \href{http://textism.com/tools/textile/}{Textile}, \href{http://docutils.sourceforge.net/rst.html}{reStructuredText},
\href{http://www.triptico.com/software/grutatxt.html}{Grutatext}, and \href{http://ettext.taint.org/doc/}{EtText} -{}- the single biggest source of
inspiration for Markdown\textquotesingle{}s syntax is the format of plain text email. %sourcepos(35:1-41:68)

\subsection{Block Elements} %sourcepos(43:1-43:17)

\subsubsection{Paragraphs and Line Breaks} %sourcepos(45:1-45:30)

A paragraph is simply one or more consecutive lines of text, separated
by one or more blank lines. (A blank line is any line that looks like a
blank line -{}- a line containing nothing but spaces or tabs is considered
blank.) Normal paragraphs should not be indented with spaces or tabs. %sourcepos(47:1-50:69)

The implication of the \textquotedbl{}one or more consecutive lines of text\textquotedbl{} rule is
that Markdown supports \textquotedbl{}hard-wrapped\textquotedbl{} text paragraphs. This differs
significantly from most other text-to-HTML formatters (including Movable
Type\textquotesingle{}s \textquotedbl{}Convert Line Breaks\textquotedbl{} option) which translate every line break
character in a paragraph into a \texttt{\textless{}br /\textgreater{}} tag. %sourcepos(52:1-56:45)

When you \emph{do} want to insert a \texttt{\textless{}br /\textgreater{}} break tag using Markdown, you
end a line with two or more spaces, then type return. %sourcepos(58:1-59:53)

\subsubsection{Headers} %sourcepos(61:1-61:11)

Markdown supports two styles of headers, {[}Setext{]} {[}1{]} and {[}atx{]} {[}2{]}. %sourcepos(63:1-63:68)

Optionally, you may \textquotedbl{}close\textquotedbl{} atx-style headers. This is purely
cosmetic -{}- you can use this if you think it looks better. The
closing hashes don\textquotesingle{}t even need to match the number of hashes
used to open the header. (The number of opening hashes
determines the header level.) %sourcepos(65:1-69:29)

\subsubsection{Blockquotes} %sourcepos(72:1-72:15)

Markdown uses email-style \texttt{\textgreater{}} characters for blockquoting. If you\textquotesingle{}re
familiar with quoting passages of text in an email message, then you
know how to create a blockquote in Markdown. It looks best if you hard
wrap the text and put a \texttt{\textgreater{}} before every line: %sourcepos(74:1-77:46)

\begin{quote} %sourcepos(79:1-84:47)
This is a blockquote with two paragraphs. Lorem ipsum dolor sit amet,
consectetuer adipiscing elit. Aliquam hendrerit mi posuere lectus.
Vestibulum enim wisi, viverra nec, fringilla in, laoreet vitae, risus. %sourcepos(79:3-81:72)

Donec sit amet nisl. Aliquam semper ipsum sit amet velit. Suspendisse
id sem consectetuer libero luctus adipiscing. %sourcepos(83:3-84:47)

\end{quote} %sourcepos(79:1-84:47)

Markdown allows you to be lazy and only put the \texttt{\textgreater{}} before the first
line of a hard-wrapped paragraph: %sourcepos(86:1-87:33)

\begin{quote} %sourcepos(89:1-91:70)
This is a blockquote with two paragraphs. Lorem ipsum dolor sit amet,
consectetuer adipiscing elit. Aliquam hendrerit mi posuere lectus.
Vestibulum enim wisi, viverra nec, fringilla in, laoreet vitae, risus. %sourcepos(89:3-91:70)

\end{quote} %sourcepos(89:1-91:70)

\begin{quote} %sourcepos(93:1-94:45)
Donec sit amet nisl. Aliquam semper ipsum sit amet velit. Suspendisse
id sem consectetuer libero luctus adipiscing. %sourcepos(93:3-94:45)

\end{quote} %sourcepos(93:1-94:45)

Blockquotes can be nested (i.e. a blockquote-in-a-blockquote) by
adding additional levels of \texttt{\textgreater{}}: %sourcepos(96:1-97:32)

\begin{quote} %sourcepos(99:1-103:26)
This is the first level of quoting. %sourcepos(99:3-99:37)

\begin{quote} %sourcepos(101:3-101:30)
This is nested blockquote. %sourcepos(101:5-101:30)

\end{quote} %sourcepos(101:3-101:30)

Back to the first level. %sourcepos(103:3-103:26)

\end{quote} %sourcepos(99:1-103:26)

Blockquotes can contain other Markdown elements, including headers, lists,
and code blocks: %sourcepos(105:1-106:16)

\begin{quote} %sourcepos(108:1-115:58)
\subsection{This is a header.} %sourcepos(108:3-108:22)

\begin{enumerate} %sourcepos(110:3-112:2)
\item This is the first list item. %sourcepos(110:8-110:35)

\item This is the second list item. %sourcepos(111:8-111:36)

\end{enumerate} %sourcepos(110:3-112:2)

Here\textquotesingle{}s some example code: %sourcepos(113:3-113:27)

\begin{verbatim}
return shell_exec("echo $input | $markdown_script");
\end{verbatim} %sourcepos(115:7-115:58)

\end{quote} %sourcepos(108:1-115:58)

Any decent text editor should make email-style quoting easy. For
example, with BBEdit, you can make a selection and choose Increase
Quote Level from the Text menu. %sourcepos(117:1-119:31)

\subsubsection{Lists} %sourcepos(122:1-122:9)

Markdown supports ordered (numbered) and unordered (bulleted) lists. %sourcepos(124:1-124:68)

Unordered lists use asterisks, pluses, and hyphens -{}- interchangably
-{}- as list markers: %sourcepos(126:1-127:19)

\begin{itemize} %sourcepos(129:1-132:0)
\item Red %sourcepos(129:5-129:7)

\item Green %sourcepos(130:5-130:9)

\item Blue %sourcepos(131:5-131:8)

\end{itemize} %sourcepos(129:1-132:0)

is equivalent to: %sourcepos(133:1-133:17)

\begin{itemize} %sourcepos(135:1-138:0)
\item Red %sourcepos(135:5-135:7)

\item Green %sourcepos(136:5-136:9)

\item Blue %sourcepos(137:5-137:8)

\end{itemize} %sourcepos(135:1-138:0)

and: %sourcepos(139:1-139:4)

\begin{itemize} %sourcepos(141:1-144:0)
\item Red %sourcepos(141:5-141:7)

\item Green %sourcepos(142:5-142:9)

\item Blue %sourcepos(143:5-143:8)

\end{itemize} %sourcepos(141:1-144:0)

Ordered lists use numbers followed by periods: %sourcepos(145:1-145:46)

\begin{enumerate} %sourcepos(147:1-150:0)
\item Bird %sourcepos(147:5-147:8)

\item McHale %sourcepos(148:5-148:10)

\item Parish %sourcepos(149:5-149:10)

\end{enumerate} %sourcepos(147:1-150:0)

It\textquotesingle{}s important to note that the actual numbers you use to mark the
list have no effect on the HTML output Markdown produces. The HTML
Markdown produces from the above list is: %sourcepos(151:1-153:41)

If you instead wrote the list in Markdown like this: %sourcepos(155:1-155:52)

\begin{enumerate} %sourcepos(157:1-160:0)
\item Bird %sourcepos(157:5-157:8)

\item McHale %sourcepos(158:5-158:10)

\item Parish %sourcepos(159:5-159:10)

\end{enumerate} %sourcepos(157:1-160:0)

or even: %sourcepos(161:1-161:8)

\begin{enumerate} %sourcepos(163:1-166:0)
\setcounter{enumi}{3} %sourcepos(163:1-166:0)
\item Bird %sourcepos(163:4-163:7)

\item McHale %sourcepos(164:4-164:9)

\item Parish %sourcepos(165:4-165:9)

\end{enumerate} %sourcepos(163:1-166:0)

you\textquotesingle{}d get the exact same HTML output. The point is, if you want to,
you can use ordinal numbers in your ordered Markdown lists, so that
the numbers in your source match the numbers in your published HTML.
But if you want to be lazy, you don\textquotesingle{}t have to. %sourcepos(167:1-170:46)

To make lists look nice, you can wrap items with hanging indents: %sourcepos(172:1-172:65)

\begin{itemize} %sourcepos(174:1-179:0)
\item Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Aliquam hendrerit mi posuere lectus. Vestibulum enim wisi,
viverra nec, fringilla in, laoreet vitae, risus. %sourcepos(174:5-176:52)

\item Donec sit amet nisl. Aliquam semper ipsum sit amet velit.
Suspendisse id sem consectetuer libero luctus adipiscing. %sourcepos(177:5-178:61)

\end{itemize} %sourcepos(174:1-179:0)

But if you want to be lazy, you don\textquotesingle{}t have to: %sourcepos(180:1-180:46)

\begin{itemize} %sourcepos(182:1-187:0)
\item Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Aliquam hendrerit mi posuere lectus. Vestibulum enim wisi,
viverra nec, fringilla in, laoreet vitae, risus. %sourcepos(182:5-184:48)

\item Donec sit amet nisl. Aliquam semper ipsum sit amet velit.
Suspendisse id sem consectetuer libero luctus adipiscing. %sourcepos(185:5-186:57)

\end{itemize} %sourcepos(182:1-187:0)

List items may consist of multiple paragraphs. Each subsequent
paragraph in a list item must be indented by either 4 spaces
or one tab: %sourcepos(188:1-190:11)

\begin{enumerate} %sourcepos(192:1-201:0)
\item This is a list item with two paragraphs. Lorem ipsum dolor
sit amet, consectetuer adipiscing elit. Aliquam hendrerit
mi posuere lectus. %sourcepos(192:5-194:22)

Vestibulum enim wisi, viverra nec, fringilla in, laoreet
vitae, risus. Donec sit amet nisl. Aliquam semper ipsum
sit amet velit. %sourcepos(196:5-198:19)

\item Suspendisse id sem consectetuer libero luctus adipiscing. %sourcepos(200:5-200:61)

\end{enumerate} %sourcepos(192:1-201:0)

It looks nice if you indent every line of the subsequent
paragraphs, but here again, Markdown will allow you to be
lazy: %sourcepos(202:1-204:5)

\begin{itemize} %sourcepos(206:1-213:0)
\item This is a list item with two paragraphs. %sourcepos(206:5-206:44)

This is the second paragraph in the list item. You\textquotesingle{}re
only required to indent the first line. Lorem ipsum dolor
sit amet, consectetuer adipiscing elit. %sourcepos(208:5-210:39)

\item Another item in the same list. %sourcepos(212:5-212:34)

\end{itemize} %sourcepos(206:1-213:0)

To put a blockquote within a list item, the blockquote\textquotesingle{}s \texttt{\textgreater{}}
delimiters need to be indented: %sourcepos(214:1-215:31)

\begin{itemize} %sourcepos(217:1-221:0)
\item A list item with a blockquote: %sourcepos(217:5-217:34)

\begin{quote} %sourcepos(219:5-220:25)
This is a blockquote
inside a list item. %sourcepos(219:7-220:25)

\end{quote} %sourcepos(219:5-220:25)

\end{itemize} %sourcepos(217:1-221:0)

To put a code block within a list item, the code block needs
to be indented \emph{twice} -{}- 8 spaces or two tabs: %sourcepos(222:1-223:47)

\begin{itemize} %sourcepos(225:1-228:0)
\item A list item with a code block: %sourcepos(225:5-225:34)

\begin{verbatim}
<code goes here>
\end{verbatim} %sourcepos(227:9-228:0)

\end{itemize} %sourcepos(225:1-228:0)

\subsubsection{Code Blocks} %sourcepos(229:1-229:15)

Pre-formatted code blocks are used for writing about programming or
markup source code. Rather than forming normal paragraphs, the lines
of a code block are interpreted literally. Markdown wraps a code block
in both \texttt{\textless{}pre\textgreater{}} and \texttt{\textless{}code\textgreater{}} tags. %sourcepos(231:1-234:34)

To produce a code block in Markdown, simply indent every line of the
block by at least 4 spaces or 1 tab. %sourcepos(236:1-237:36)

This is a normal paragraph: %sourcepos(239:1-239:27)

\begin{verbatim}
This is a code block.
\end{verbatim} %sourcepos(241:5-242:0)

Here is an example of AppleScript: %sourcepos(243:1-243:34)

\begin{verbatim}
tell application "Foo"
    beep
end tell
\end{verbatim} %sourcepos(245:5-248:0)

A code block continues until it reaches a line that is not indented
(or the end of the article). %sourcepos(249:1-250:28)

Within a code block, ampersands (\texttt{\&}) and angle brackets (\texttt{\textless{}} and \texttt{\textgreater{}})
are automatically converted into HTML entities. This makes it very
easy to include example HTML source code using Markdown -{}- just paste
it and indent it, and Markdown will handle the hassle of encoding the
ampersands and angle brackets. For example, this: %sourcepos(252:1-256:49)

\begin{verbatim}
<div class="footer">
    &copy; 2004 Foo Corporation
</div>
\end{verbatim} %sourcepos(258:5-261:0)

Regular Markdown syntax is not processed within code blocks. E.g.,
asterisks are just literal asterisks within a code block. This means
it\textquotesingle{}s also easy to use Markdown to write about Markdown\textquotesingle{}s own syntax. %sourcepos(262:1-264:68)

\begin{verbatim}
tell application "Foo"
    beep
end tell
\end{verbatim} %sourcepos(266:1-270:3)

\subsection{Span Elements} %sourcepos(272:1-272:16)

\subsubsection{Links} %sourcepos(274:1-274:9)

Markdown supports two style of links: \emph{inline} and \emph{reference}. %sourcepos(276:1-276:63)

In both styles, the link text is delimited by {[}square brackets{]}. %sourcepos(278:1-278:64)

To create an inline link, use a set of regular parentheses immediately
after the link text\textquotesingle{}s closing square bracket. Inside the parentheses,
put the URL where you want the link to point, along with an \emph{optional}
title for the link, surrounded in quotes. For example: %sourcepos(280:1-283:54)

This is \href{http://example.com/}{an example} inline link. %sourcepos(285:1-285:54)

\href{http://example.net/}{This link} has no title attribute. %sourcepos(287:1-287:56)

\subsubsection{Emphasis} %sourcepos(289:1-289:12)

Markdown treats asterisks (\texttt{*}) and underscores (\texttt{\_}) as indicators of
emphasis. Text wrapped with one \texttt{*} or \texttt{\_} will be wrapped with an
HTML \texttt{\textless{}em\textgreater{}} tag; double \texttt{*}\textquotesingle{}s or \texttt{\_}\textquotesingle{}s will be wrapped with an HTML
\texttt{\textless{}strong\textgreater{}} tag. E.g., this input: %sourcepos(291:1-294:33)

\emph{single asterisks} %sourcepos(296:1-296:18)

\emph{single underscores} %sourcepos(298:1-298:20)

\textbf{double asterisks} %sourcepos(300:1-300:20)

\textbf{double underscores} %sourcepos(302:1-302:22)

\subsubsection{Code} %sourcepos(304:1-304:8)

To indicate a span of code, wrap it with backtick quotes (\texttt{`}).
Unlike a pre-formatted code block, a code span indicates code within a
normal paragraph. For example: %sourcepos(306:1-308:30)

Use the \texttt{printf()} function. %sourcepos(310:1-310:28)
jgm commented 7 months ago

Seems like a reasonable syntax. Might there be issues in some cases with the space before the comment?

jeroen commented 7 months ago

Seems like a reasonable syntax. Might there be issues in some cases with the space before the comment?

I can't think of any. But we could remove the space if that is safer.

jgm commented 7 months ago

I guess it should be okay to leave the space. Have you tested a range of documents with this, to make sure there are no adverse effects?