bdarcus / csln

Reimagining CSL
Mozilla Public License 2.0
13 stars 0 forks source link

feat: add citation support #129

Closed bdarcus closed 9 months ago

bdarcus commented 1 year ago

It seems right to add this key piece of functionality next.

I will likely only add author-date initially, since that's all I really use myself. But if so, I will design it all along the same lines as 1.0.

Also, I will initially only support a more abstract import format; not actual documents. Still waiting on djot support for citations.

I thought I had this working, but it turns out not; process_citations is currently returning empty vectors.

  "citations": [
    [],
    [],
    []
  ]

Digging a bit more, I think I may need to rethink and refactor the rendering code to account for the citations.


Details

I'm not sure how best to do this, but probably need to look at https://github.com/jgm/citeproc and https://github.com/zotero/citeproc-rs, though I have a hard time understanding the code in many places.

This could be the citation definition, but doesn't seem right.

https://github.com/zotero/citeproc-rs/blob/2ab195a1e6f84f0ff284813ece61dc62096abbfe/crates/pandoc-types/src/definition.rs#L222

See, though, the design document. It takes a parallel approach, where in "Pass 1", it creates different representations of the intermediate output, that can be resolved in "Pass 2."

haskell citeproc

Here's the haskell processor type, which makes more sense to me.

https://github.com/jgm/citeproc/blob/6969ce218d0dfdee29d54cce674c7f9cef4b4f0a/src/Citeproc/Types.hs#L310 https://github.com/jgm/citeproc/blob/6969ce218d0dfdee29d54cce674c7f9cef4b4f0a/src/Citeproc/Types.hs#L263

data Citation a =
  Citation { citationId         :: Maybe Text
           , citationNoteNumber :: Maybe Int
           , citationItems      :: [CitationItem a] }

data CitationItem a =
  CitationItem
  { citationItemId             :: ItemId
  , citationItemLabel          :: Maybe Text
  , citationItemLocator        :: Maybe Text
  , citationItemType           :: CitationItemType
  , citationItemPrefix         :: Maybe a
  , citationItemSuffix         :: Maybe a
  , citationItemData           :: Maybe (Reference a)
  }

data CitationItemType =
    AuthorOnly      -- ^ e.g., Smith
  | SuppressAuthor  -- ^ e.g., (2000, p. 30)
  | NormalCite      -- ^ e.g., (Smith 2000, p. 30)

Here's the high-level processing logic, which is basically what I am planning here.

https://github.com/jgm/citeproc/blob/6969ce218d0dfdee29d54cce674c7f9cef4b4f0a/src/Citeproc.hs#L20C23-L20C23

-- | Process a list of 'Citation's, producing formatted citations
-- and a bibliography according to the rules of a CSL 'Style'.
-- If a 'Lang' is specified, override the style's default locale.
-- To obtain a 'Style' from an XML stylesheet, use
-- 'parseStyle' from "Citeproc.Style".
citeproc :: CiteprocOutput a
         => CiteprocOptions    -- ^ Rendering options
         -> Style a            -- ^ Parsed CSL style
         -> Maybe Lang         -- ^ Overrides default locale for style
         -> [Reference a]      -- ^ List of references (bibliographic data)
         -> [Citation a]       -- ^ List of citations to process
         -> Result a

Question: how are rendered citations inserted in document?

Disambiguation

... I also need to figure out where and how disambiguation fits in this.

https://github.com/jgm/citeproc/blob/6969ce218d0dfdee29d54cce674c7f9cef4b4f0a/src/Citeproc/Eval.hs#L408

I'm hoping other aspects of this design will make this part easier, but I haven't yet figured it out.

My initial thoughts:

The main aspects of disambiguation I need to focus on first are (author) names, and years.

The latter is easy because in practice it's global. So I've already implemented it.

The former is the tricky piece, since typically it applies to citations, and not bibliographies (I guess unless a style requires a given name initial to be expanded?).

I suppose one option would be to follow the citeproc-rs approach: somehow generate alternate name representations on first pass, and disambiguate them separately.

Maybe I could create a hash-table for author names, something vaguely like:

pub struct Author {
    pub name: String,
    pub disambiguate_given: Vec<String>,
    pub role: ContributorRole,
    pub substitute: bool,
}

Regardless of the details, the idea would be to lookup the right name with disambiguation string in that hash map.

bdarcus commented 11 months ago

@jgm - can I ask you a high-level question about citeproc and pandoc integration for citation rendering?

You render citations independently of the document, and insert them in the document how, and when?

jgm commented 11 months ago

@bdarcus - after the input format is parsed to a Pandoc AST, we apply

processCitations  :: PandocMonad m => Pandoc -> m Pandoc

which transforms the Pandoc AST by (1) replacing each citation with the formatted citation and (2) adding a bibliography. The code is in Text.Pandoc.Citeproc.

The transformed AST can then be rendered by any of the pandoc writers. Small complication: for display details, we use special Span and Div elements. These will be ignored by most writers, but for a few writers we've implemented code that responds to them by doing the proper formatting (e.g. docx, latex, html).

bdarcus commented 11 months ago

Thanks @jgm!

I have a hard time reading Haskell code. Am I correct that the output you use from citeproc is basically the same as the server JSON; an array of citation strings?

jgm commented 11 months ago

My Haskell citeproc library uses polymorphic types.

-- | Process a list of 'Citation's, producing formatted citations
-- and a bibliography according to the rules of a CSL 'Style'.
-- If a 'Lang' is specified, override the style's default locale.
-- To obtain a 'Style' from an XML stylesheet, use
-- 'parseStyle' from "Citeproc.Style".
citeproc :: CiteprocOutput a
         => CiteprocOptions    -- ^ Rendering options
         -> Style a            -- ^ Parsed CSL style
         -> Maybe Lang         -- ^ Overrides default locale for style
         -> [Reference a]      -- ^ List of references (bibliographic data)
         -> [Citation a]       -- ^ List of citations to process
         -> Result a

For pandoc we use a = Inlines, so that the contents are pandoc Inline sequences, not raw strings. The typeclass instance for this is defined in Citeproc.Pandoc:

instance CiteprocOutput Inlines where
...

We also have an instance for HTML, which we use for the standard citeproc test suite.

The advantage of this is that when we're using pandoc, we can define bibliography entries with any of the formatting pandoc provides (e.g. math), and this will be carried through all the way to the result.

bdarcus commented 11 months ago

I only need to implement this to a proof-of-concept state ATM, so my plan is just return something similar to the citeproc server JSON.

{
  "citations": [ ... ],
  "bibliography": [ ... ],
}

I was just confused how one would replace the citation input with that output, but I guess it doesn't matter too much now.

The advantage of this is that when we're using pandoc, we can define bibliography entries with any of the formatting pandoc provides (e.g. math), and this will be carried through all the way to the result.

Right. Am thinking to use djot for this somehow, if and when it gets citations.