aaren / pandoc-reference-filter

internal referencing pandoc filter
41 stars 9 forks source link

multi document cross references #1

Open aaren opened 10 years ago

aaren commented 10 years ago

Use case: writing a thesis with multiple chapters.

When creating a pdf, latex takes care of all internal referencing. Multiple source documents are concatenated into a single continuous markdown and then injected into the thesis latex template. When latex comes to manage the internal references it is dealing with a single (long) file.

Html output presents a problem: either we concatenate all of the documents into a single html page or we have internal referencing confined to each page. This is not what we want: it is preferable to have shorter pages on the web, covering a single distinct topic.

What is needed is a mechanism to allow cross referencing with multiple source documents. I am going to present two possible solutions here:

  1. Multiple passes with a persistent file containing the references.
  2. A single pass with a directory of input files contained in the metadata.
aaren commented 10 years ago

Multiple passes, persistent references

The idea here is fairly simple. We modify the reference manager such that it can

  1. Consume references without modifying the input
  2. Append references to a file
  3. Take references at init
refman = ReferenceManager()
refman.consume(input)
refman.write(ref_file)

We do this for each input file, until the references file (probably json) is fully populated. After this we apply the filter, passing in a metadata flag telling it that there is a references file.

refman = ReferenceManager(ref_file)
...
# create jsonfilter with this refman
...

In a multi chapter latex document we might prefix the figures with the chapter number, e.g. 'see Figure 2.3' to reference Figure 3 in chapter 2. We can likely do that here by using the section counter.

The link conversion within the same file is

#ref -> [ref_no](#ref)

To link to another file we need

#ref -> [ref_no](path/to/doc#ref)

path/to/doc is a relative link. We can define this relative to either the file it is used in or the top-level of all of the documents.

These relative paths need to be determined at some point and this is the difficult part of this method.

aaren commented 10 years ago

Single pass, config file

This method is slightly more complex but potentially more robust and powerful. It assumes that we are processing the documents one by one (like we would with Jekyll).

In a YAML config file we define all of the files that are in our document. Something like

inputs:
    - chapters/01-intro.md
    - chapters/02-litreview.md
    - chapters/03-methods.md
    - chapters/methods/01-waves.md
    - chapters/methods/02-blah.md

Or using tags for the difference sections:

chapters:
    introduction: 01-intro.md
    litreview: 02-litreview.md
    methods:
        - 03-methods.md
        - methods/01-waves.md
        - methods/02-blah.md
for each document
    consume_all_references
    for each reference
        if has prefix
            find file path in yaml
            create reference with file path

Actually I think we will need to do a first pass to consume all references from all docs and link them to file paths.

Link conversion is something like this:

#ref -> [ref_no](#ref)
#methods/other_ref -> [ref_no](chapters/methods/01-waves#ref)

Would need to constrain how paths are created in jekyll.

It would make some sense to write this as a Jekyll plugin, but then we would be confined to Jekyll and we'd have to write some Ruby.

Needs more thought.


We still need to consume all of the references from all documents specified in the inputs. For any given #ref we won't know which file it has been defined in so we'll have to look at all of them anyway.

We could have some caching as well, with a json file being updated with the reference content. We can then look at the modification times of the source files to see if we need to update the cache or not. In the first instance let's not do this.

One thing that we'll have to resolve is how we get the metadata into pandoc. On the command line we would just pass it in along with the file:

pandoc meta.yaml source.md > output.html

but the jekyll-pandoc plugin doesn't offer a means to have multiple input files. One option is to declare all of the possible inputs in the yaml at the top of each file - a bit like includes for references. This would have the benefit of not needing to scan the entire inputs tree for each file - only scan the inputs that are specified.

Changes needed: