demydd / pandoc

Automatically exported from code.google.com/p/pandoc
0 stars 0 forks source link

citation support #55

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Citation support would be valuable. 

The typical approach in existing system is to use a natural language token
key as a reference id. In LaTeX, for example, one would do \cite{doe99} in
document, which would match an equivalent BibTeX key. The equivalent in
ReStructured Text would be [doe99]_.

There are a few wrinkles, though. 

1) citations may have more than one reference; e.g. (Doe, 1999; Smith,
2000). As such, the markup would probably need to allow multiple keys.

2) one must in some cases be able to specify the locational information
within a document, typically a page number; e.g. (Doe, 1999:23) where "23"
is the page number.

3) citation references may have local styling modifiers (like suppress
author) and/or prefix and suffix information (see Doe, 1999).

4) cited content is increasingly online these days, a trend that will only
continue. I wonder, then, whether it would make sense to allow a URI as a
key? E.g. [http://ex.net/docs/1]_.

All of which suggests that a comprehensive and flexible system may mean
something like:

     [doe99:page=23;http://ex.net/docs/1]_

... where there is a delimiter (in this case a semi-colon) for multiple
references, a way to indicate page numbers and such, and URIs can be used.
Not sure what to do about local styling swithes, though.

Finally, I agree with a mailing list comment that it would be nice to allow
use of BibTeX data files, but I'd also like to see more flexibility as
well. For background, I am the author of CSL, the XML citation styling
language.

Original issue reported on code.google.com by bdar...@gmail.com on 10 Feb 2008 at 3:22

GoogleCodeExporter commented 8 years ago
Hmm ... I guess that colon delimiter would be a problem with URIs; maybe then 
...

     [doe99@page=23;see also \http://ex.net/docs/1]_

...?

Original comment by bdar...@gmail.com on 10 Feb 2008 at 3:26

GoogleCodeExporter commented 8 years ago

Original comment by fiddloso...@gmail.com on 18 Feb 2008 at 4:23

GoogleCodeExporter commented 8 years ago
Hi Bruce,

it a pleasure to eventually get to meet you personally, so to speak.
I've been following the bibliographic debate you have been animating
for quite a few years. I was developing a wiki, UniWkka, with
bibliographic support and I studied both the WIKINDX approach (I
should still be a sourceforge member of that project) and your xbiblio
project - I'll say something about that below.

To make a long story long, and starting from the end - just to add
some confusion - let me describe briefly the trivial patch I attached
here. It adds citation support in the native pandoc type system and a
parser for markdown.

The syntax is:
[citationLabel:optinaLocation;otherOptionalLabels]_

so the following input:

Citation test: [rossato2005: p. 26; caso1999; pascuzzi2000: p, 28]_.

will produce:

Pandoc (Meta [] [] "")
[ Para [Str "Citation",Space,Str "test:",Space,Citation [("rossato2005","p.
26"),("caso1999",""),("pascuzzi2000","p, 28")],Str "."] ]

If John is going to accept something like this, coding the export
writers would take a second, obviously, unless we start discussing of
using an external library for rendering those citations... which is
actually my final aim...;)

BTW, if John is going to accept something like this, we should discuss
a bit about the markdown syntax. I don't think URIs are feasible:
Bruce's second suggestion interferes with mailto URIs, (and even with
some URLs). I don't like the solution of hard-coding protocols in the
parser, but this is a possible solution.

Still I prefer the [label:location]_ solution.

But the real stuff I'd like to start coding is the separate library
for rendering the citations, obviously, which brings me to xbiblio,
CSL and the citeproc implementation.

I must confess that the CSL documentation is not quite clear to me:
I'm not really proficient with xslt, and it is not really clear how
easy it would be to implement the CSL in such a complete way to handle
footnote citations as needed by an European legal scholar as I am.

I had a look at citeproc (with my limited knowledge, especially for
the 2.0 part of xslt), I also had a look at the ruby implementation
(but I don't know ruby very well), but I was not able to imagine a
clean Haskell implementation.

I've been struggling with the need of some bibliographic tools for a
long time. A solution I adopted was to fork the dying Wakka wiki
engine, so I started developing UniWakka, by adding a bibliographic
engine based on Latex.

I was following Bruce's work and WIKINDX (we had some plan of
integrating UniWakka with WIKINDX), but the PHP support was totally
lacking and I had not enough programming knowledge to implement a
citation style language myself. But I've been thinking about it since
then.

Now I think that some of the Haskell features could make the
implementation of such a citation style quite easy. I'm thinking about
using type classes to define a class of bibliographic entry types
(modelled on MODS) and a citation style types (which could be modeled
on CSL). And then a rendering engine that takes a bibliographic object
(a type with all the needed methods' implementations) and a style
object (a type with the stylistic methods used in the rendering
process - authors sort and rendering functions, year disambiguation,
location formatting, etc) to render the citation where needed: in the
text, in a footnote or in a bibliographic list.

Such a back-end could be used to implement a specific citation style
or - if I get it right - ... a citation style language like CSL, I
believe.

That's the idea, but this would just be a library pandoc could be
using.

This is it. Sorry for such a long post, but I hope this is the right
place to gather ideas on this kind of project (I've been dreaming
about since I gained a deeper knowledge of the Haskell
mysteries...;-).

Andrea Rossato

Original comment by andrea.rossato@gmail.com on 11 Mar 2008 at 12:13

Attachments:

GoogleCodeExporter commented 8 years ago
Hi Andrea. On citeproc/csl, I'm not much of a programmer, and don't really do 
Haskell
at all, but I could totally see an implementation. Haskell's a functional 
language,
yes? So is XSLT. In the (now outdated) XSLT version of citeproc, you basically 
are
iterating through three (sometimes complex) lists: the instructions for citation
formatting from the CSL files, the list of citations, and the data from the
bibligraphic source descriptions. 

So process would probably go something like:

1. load and parse CSL (basically just map XML to nested Haskell lists)
2. load citations (another list)
3. iterate though 2; load source data, map to internal model
4. run formatting process, which iterates though 1, passing through formatting
parameters to another list, which includes the formatted data
5. run generic output through output filter (to TeX, HTML, etc.)

The fullest implementation of CSL code is the Zotero implementation, which is 
written
in Javascript.

On your syntax, I've not looked closely, but one thing that's a little 
inflexible is
the "p. 34" bit. 

Original comment by bdar...@gmail.com on 11 Mar 2008 at 1:31

GoogleCodeExporter commented 8 years ago
> On your syntax, I've not looked closely, but one thing that's a little 
inflexible 
> isthe "p. 34" bit.

That's just an example (the "p." is not required): the parser will just return 
a list
of tuples made up of [(label,location)]. That is to say:

Citation test: [rossato2005: 26; caso1999; a multi word label: with quite a long
description of the place actually cited]_.

will produce:
Pandoc (Meta [] [] "")
[ Para [Str "Citation",Space,Str "test:",Space,Citation
[("rossato2005","26"),("caso1999",""),("a multi word label","with quite a long
description of the place actually cited")],Str "."] ]

That is to say, the Citation type (constructor) will be:
Citation [(String,String)]

That is to say: a list of a tuple made up of a string (the bibliographic entry 
id)
and the string of what comes next: "page=24" or "p. 25" or whatever (I'd leave 
that
to the bibliographic facility).

> The fullest implementation of CSL code is the Zotero implementation

I'll take a look at Zotero. And yes, the one you are indicating is the general 
idea
of the algorithm. And yes, Haskell and XSLT are both functional languages.... 
still
Haskell has type classes and other goodies which could be exploited to have a 
bit
more abstraction. I hope to come up with a working example soon just to clarify 
my
idea a bit.

Original comment by andrea.rossato@gmail.com on 11 Mar 2008 at 3:45

GoogleCodeExporter commented 8 years ago
Citation styling is a PITA, and the work is really in translating from human 
texts
into something a computer can understand. The idea behind CSL is really that one
should only do this once, in a language-neutral format, and that implementing 
support
for it is really just a question of writing the generic logic.

Also, FWIW, the abstractions I used to design CSL were: a) classes of styles
(author-date, note, etc.) and b) classes of data sources (monographs (aka 
books),
parts-in-monographs (aka chapters) and parts-in-serials (aka articles). 

Those abstractions are now mostly buried, but they do show up in, for example, 
the
fallback behavior (e.g. how to handle typed resources that have no defined 
templates).

If you have any questions about CSL, feel free to join the xbib dev list and 
discuss
them there. We've got versions for PHP, Ruby and Python in various states of
development, so I'm sure there'd be interest in different design approaches.

Original comment by bdar...@gmail.com on 11 Mar 2008 at 3:59

GoogleCodeExporter commented 8 years ago
I think the proposed syntax looks pretty good.

It seems to me that we don't need the URLs, since pandoc already has
ways to link to URLs.  Also, I'd recommend using @ instead of : to
separate the citation reference from the location information, since
people might want to use : in their citation keys (I do).
I like the idea of keeping the location information freeform.  That's
how it is done in LaTeX.

So, to summarize, I'm in favor of:

[citekey1@citeloc1; citekey2@citeloc2]_

I don't know how to include styling options; I'd suggest that we start without
them and think about adding them later.

The hard part is going to be the backend library.  The nice thing is that
it would only have to produce pandoc data structures, and we
could get output in all the formats pandoc supports.  (bibtex for man
pages, anyone?)

Original comment by fiddloso...@gmail.com on 12 Mar 2008 at 10:15

GoogleCodeExporter commented 8 years ago
> The hard part is going to be the backend library.  The nice thing is that
it would only have to produce pandoc data structures, and we
could get output in all the formats pandoc supports.  (bibtex for man
pages, anyone?)

Well, yes, this is going to be difficult, but I'm not going to start 
implementing CSL
right away (that would be too difficult). Instead I will try to design an 
extensible
library that can be used to implement a specific style and/or a citation style. 
I
have some general idea but I need to start coding before sharing it.

Still I don't think that external library should block this new feature. For 
the time
being we could use the new Citation type in the pandoc native format to produce 
an
output for latex and docbook - formats that can then use their own system to 
generate
the citations and the bibliography. Right?

If you agree on the syntax:

[citekey1@citeloc1; citekey2@citeloc2]_

I'm going to update the patch.

While I know what I must produce with the latex writer, what is the expected 
DocBook
output?

For OpenDocument: I've had a look at Zotero. I have a different idea for my 
library
though, but the Zotero code will be something to carefully study!!

I've seen they have an OpenOffice extension and they seem to be able to insert
citations and a bibliography - and update them. The OpenDocument output of that
pandoc syntax should be compatible with Zotero, as far as possible (I don't 
know if
it is possible, yet). If Bruce - or someone else - can give me instruction 
great,
otherwise I'll have a look at the Zotero OpenOffice extension's source code.

Cheers

Original comment by andrea.rossato@gmail.com on 13 Mar 2008 at 9:46

GoogleCodeExporter commented 8 years ago
fiddlosopher: +1 on the "@' delimiter. I'd just leave room for the notion that 
the
value that follows it is not always going to be a page number. Maybe assume 
page by
default, but allow for some key-value notation?

andrea: DocBook 5 (and maybe even pre-5) has the new biblioref element for 
linking.

http://www.docbook.org/tdg/en/html/biblioref.html

ODF: I would study what Zotero has done, but note that there are problems with 
the
implementation. In particular, they link between citation and source using a 
local DB
id, which means the citations can't be updated without a) Zotero, and b) a 
specific
database instance (!). This is not good.

But note: ODF 1.2 is getting an advanced new extensible metadata system based 
on RDF.
I expect it to be implemented for OOo 3.0. I'd like to move citation 
implementations
(including Zotero) to using that instead. There, you'd use a new 
text:meta-field and
encode both the citation and the source data in RDF, and embed it all in the 
file
package. That, it seems to me, would solve some of the current problems with 
Zotero. 

This shouldn't stop you now, of course; just warning you.

Finally, I don't think CSL is that hard to implement, but I can see you might 
want to
just work with, say, native Haskell structures at first. I'd guess the easy way 
to
incrementally do that is to just start with a simple flat hash of bib data, and 
a
nested list of citation formatting instructions. Once you can do the generic
formating of that, you can then add in other kinds of structures (that can then 
map
to CSL): conditionals, substitution, groups, etc.

Original comment by bdar...@gmail.com on 13 Mar 2008 at 1:18

GoogleCodeExporter commented 8 years ago
Here's a proposal (with some options) based on the above and further
discussions with bdarcus:

Syntax
======

Citations like [doe99@page 9; doe04]_, with keys like:

    _[doe99]: Doe, J. (1999). Some article...

Keys could include an optional "alternate label" like this:

    _[doe04][Doe 2004]: Doe, J. (2004). Other article...

Semantics
=========

Two cases:

  1.  Citation has a corresponding key (or keys) in the document -> simple citations
  2.  Citation does not have a corresponding key (or keys) -> flexible citations

1. Simple citations:  the citation is replaced with a bracketed label that is
hyperlinked to the corresponding bibliography entry.  The bibliography is
constructed from the citation keys and appears at the end of the document,
but before any endnotes.  If a label is desired for the bibliography, the
user can simply end the document with an appropriate section heading, e.g.

     # References

So, the citation example above would appear in HTML as:

    [<a href="#refs:doe99">doe99</a>, page 9]; [<a
    href="#refs:doe04">Doe 2004</a>]

    (and, at the end of the document)

    <ol id="refs">
        <li><a id="refs:doe99">[doe99]</a> Doe, J. (1999). Some article...
        <li><a id="refs:doe04">[Doe 2004]</a> Doe, J. (2004). Some other article...
    </ol>

Or perhaps this should be a <div id="refs"> with the <ol> inside it.
And correspondingly in other output formats.

Note the effect of the optional label on "doe04".

Output in latex and other formats would be similar. Note: This would
be a lot easier if pandoc had a uniform way of creating anchors inside
documents and links to them. Then this kind of citation could be created
in a way that was independent of output formats.

2. Flexible citations: no bibliography entry is generated. The citation is
transformed in a way that varies with the output format.  The idea is to
make it into something that can easily be extracted and processed by an
external tool.  In the case of LaTeX, the external tool would be bibtex,
so we'd transform our link into

    \cite[page 9]{doe99}, \cite{doe04}.

No "simple bibliography" would be generated, since that would be up to
the external tool.  In the case of HTML, we'd transform our link into
something like

    <citation key="doe99" pages="page 9" optlabel="Doe 1999" />

The document would then be postprocessed by XSLT or whatever, which could
strip out the citations, replace them with formatted citations with links,
and add an appropriately formatted bibliography.  The user could use any
external tool for this, so the choice of bibliography database, styling,
etc., would be all up to the user.  Something like
<http://www.johankool.nl/software/citeproc/> could be used, for example.

Other output formats would have other ways of specifying citations;
the aim in each case would be to have a syntax that is easily
recognizable, so an external tool can quickly identify the citations
and handle them.

Presumably the external tool could add a namespace to the citation
keys; this could be a command-line option.  This would be useful for
avoiding collisions with other identifiers in the document.

This approach seems preferable to having pandoc do all the work.  Note
that postprocessing makes more sense than preprocessing here, because
it is much easier to find citations in HTML or LaTeX output than in
the markdown input.  (To find a citation in markdown, you need to make
sure it's not in a code block, and the only way to do that in general
is to parse the markdown.)

-----

This approach, with simple and flexible citations, allows an author to
create a simple and self-standing document with internal bibliography, and
later switch over to using an external database, just by deleting the
bibliography entries.

An alternative approach, suggested by bdarcus, is to always generate
something that can be processed by an external tool, even when a bibliography
is specified in the document.  Advantages:  you could keep the internal
bibliography in the document, even while using the external tools.
Disadvantages:  bibtex itself could no longer be the external tool for
latex, since it doesn't strip out the simple bibliography, and it looks
for \cite{} commands, which wouldn't be used for simple citations.
Possibly there are similar disadvantages for some of the other output
formats.

Original comment by fiddloso...@gmail.com on 18 May 2008 at 2:29

GoogleCodeExporter commented 8 years ago
The main distinction between John's suggestion and mine is that his creates two
different types of citations on output: one designed (only) for further 
processing,
and one designed (only) for finished rendering. 

I, by contrast, worry that the distinction will create problems, and that we 
ought to
be able to have our cake (have something to display for output) and eat it too 
(that
it be amenable to further processing).

There's no doubt that John's right, however, that my suggestion would on the 
face of
it create problems for the more typical BibTeX workflow. Perhaps one way to 
deal with
that is to assume default output without the bibliography in LaTeX, but allow a
switch to print it anyway.

WRT to other formats, the main requirement for a post-processor to be able to 
easily
update the bibliography is that it needs to be able to identify it.

I'd also like to note that in some sense the discussion about the relation 
between
general output (of, for example, the bibliography) and the specifics of citation
encoding in that output are somewhat orthogonal. E.g. I'm not clear why we 
would need
non-standard custom markup in that output format, even if we went with the
distinction John otherwise want to make?

Original comment by bdar...@gmail.com on 18 May 2008 at 6:58

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Just a reminder: legitimate bibtex keys will require some transformation
(already handled by docutils).

Specifically: suppose for example you want to use your keys as
the ``name`` or ``id`` attribute
in an HTML or XML document.
The colon is allowed in HTML but is reserved in XML for namespace specification.
The plus-sign is not allowed in either specification.
XML allows a whole bunch on non-ASCII characters but HTML does not.  

Original comment by alan.is...@gmail.com on 18 May 2008 at 9:04

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Does the approach of bibstuff_ shed any light on this discussion?

The primary approach: extract citation references from a reST document, look 
these up
in a specified BibTeX database, and produce a formatted (style based) reST
bibliography (which can then be included in the original document).

With this approach the document is not processed until after the bibliography is
generated, at which point the bibliography is part of the document.

.. _bibstuff: http://code.google.com/p/bibstuff/

Original comment by alan.is...@gmail.com on 18 May 2008 at 9:27

GoogleCodeExporter commented 8 years ago
@alan: bibstuff uses a pre-processing approach, while this proposal involves
post-processing. So rather than processing the markdown/rest input, you process 
the
HTML (or whatever) output. Think a simple piped process.

I was first assuming a pre-process, but John suggested this as an easier 
approach,
and I'm tending to agree.

Aside: it may have other indirect advantages in that if the focus is on 
processing of
output like HTML, that might easily be extended to WYSIWYG web environments 
using
Javascript.

Original comment by bdar...@gmail.com on 18 May 2008 at 9:34

GoogleCodeExporter commented 8 years ago
alan.isaac:  The difficulty I see with a pre-processing approach is that
an occurrence of '[doe99]_' in a markdown file might be inside a verbatim
text context (either inside `backticks` or in an indented code block), in
which case it should not be treated as a citation.  So in order to find the
citations in markdown source, you'd have to parse the markdown.  There are
no easy shortcuts here:  for example, you can't just scan for indented blocks,
since an indented block *might* be a list continuation.  And parsing markdown
isn't easy.

From a quick glance at the BibStuff code, it looks like you might have a
similar problem.  Is BibStuff smart enough to exclude apparent citation
keys in verbatim contexts?  If so, how does it do that?

Original comment by fiddloso...@gmail.com on 18 May 2008 at 11:39

GoogleCodeExporter commented 8 years ago
fiddlospher asks:  
"Is BibStuff smart enough to exclude apparent citation keys in verbatim 
contexts?"

No.  And nobody has ever requested it ...

Fixing it for inline literals should be fairly easy.  Fixing it for literal 
blocks
should be possible.  Here's the relevant file:
http://code.google.com/p/bibstuff/source/browse/trunk/ebnf_sp.py

(I note this file particular file lists me as author, which is true, but please 
be
aware that the creator and visionary for bibstuff is Dylan Schwilk; bibstuff 
would
exist without me but not without him.)

The problem with post processing is that while apparently simpler, iiuc, it 
requires
a separate post processor for each format, including any new supported formats. 
 In
contrast, if you want to call bibstuff's LaTeX influenced approach 
"preprocessing"
(which while true seems potentially misleading), the converter just proceeds as 
normal.

Or so it looks to me at first glance.

Alan

PS I like some of the citation syntax suggestions and hope that reST will 
sometime
soon support parameterized citations.

Original comment by alan.is...@gmail.com on 19 May 2008 at 1:17

GoogleCodeExporter commented 8 years ago
alan:

Maybe I've given up too easily on pre-processing. You're right that it has some
advantages.  I'll look carefully at bibstuff's approach before making any final 
decisions about how to proceed on this.

Original comment by fiddloso...@gmail.com on 19 May 2008 at 4:57

GoogleCodeExporter commented 8 years ago
Hi, I'm attaching a patch that implements John's porposal.

It is designed in such a way that citeproc-hs can take over every element of 
citation
formatting (tight now I'm writing the pandoc/citeproc bits).

Obviously when simple citations and flexible citations are mixed in the same 
document
or in the same citation group the result will not be perfect, but I'm still 
working
on it).

Now only the html writer support the new elements (Citation and Biblio).

When a citation has no key and citeproc-hs is not run - and now it is *not* run 
-
flexible citations will not appear.

This is a sample file. Try with:
pandoc -s -f markdown -t html test.txt

[doe99]. [This is a screen shot] of my desktop with [xmonad] and [xmobar].

Citations like [doe99@page 9; doe04; rossato300], or [doe99], vedi 
[rossato2006@page
10; antoniolli2000].

[doe04][Doe 2004]: Doe, J. (2004). Other article...                             

[doe99]: Doe, J. (1999). Some article...
[This is a screen shot]: 
http://haskell.org/sitewiki/images/a/ae/Arossato-config.png
[xmobar]: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/xmobar
[xmonad]: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/xmonad

Original comment by andrea.rossato@gmail.com on 22 Jul 2008 at 5:31

Attachments:

GoogleCodeExporter commented 8 years ago
So, now that basic citation stuff has been pushed I'd suggest to mark this 
issue as
fixed and start a new one to deal with local styling modifiers, bibliographic 
options
and all that is needed to improve the citation support.

Bruce, if you have a second you could open the next one and share with us your
opinions and suggestions on the subject?

Original comment by andrea.rossato@gmail.com on 13 Aug 2008 at 3:27

GoogleCodeExporter commented 8 years ago
Done. Issue 83.

Original comment by bdar...@gmail.com on 13 Aug 2008 at 6:56

GoogleCodeExporter commented 8 years ago
> bibtex itself could no longer be the external tool for
latex, since it doesn't strip out the simple bibliography, and it looks
for \cite{} commands, which wouldn't be used for simple citations.
Possibly there are similar disadvantages for some of the other output
formats.

Note that there is a \nocite{} command that does not insert text but writes the 
citetion key(s) to the aux file.

Original comment by c-koe...@gmx.net on 9 Sep 2008 at 8:46