bdarcus / csln

Reimagining CSL
Mozilla Public License 2.0
12 stars 0 forks source link

Borrow from biblatex #64

Open bdarcus opened 1 year ago

bdarcus commented 1 year ago

See also #61

Beyond CSL, the other excellent package first released around the same time, and similarly ambitious, is biblatex.

It has struck me its design has some similarities to what I'm doing here.

Consider their long list of completely flat parameters, aka options (and see table I've attached below for how they map to scopes):

image

They've also been ahead of us on EDTF, and looks like already figured it out.

image

biblatex-options-table.pdf

plk commented 1 year ago

For sorting and dates in biblatex the real work is done by the backend biber, mostly because such things are just too hard/messy/complex to do in TeX. biber is written in Perl and distributed as (something indistinguishable to users) as a binary and replaces the bibtex binary in the general workflow.

Sorting

Sorting is, as you probably know, fully configurable via "sorting templates" and there are several predefined for common patterns (see the easily readable definitions in [biblatex.def](https://github.com/plk/biblatex/blob/dev/tex/latex/biblatex/biblatex.def). As to your question, the general defaults work well and they can be seen in the file linked above - the original biblatex author did a good job of thinking about default sorting templates that most people use. The real work has been in allowing sorting to be highly configurable, particularly in terms of Unicode language tailoring. Specifically, what we had to do was:

The biblatex PDF doc has a lot of examples of all of these.

Dates

This is all done with a custom ETDF parsing module in biber which splits up such dates into their components and makes them available to biblatex in the .bbl file. I expect you will need a similar thing - a parser which handles the parts of ETDF you plan to support which splits into the granular fields which can be used later in constructing some output. This isn't so difficult really - I suspect there are all sorts of options for this in various languages these days. Then I added a load of options to control various output elements like christian vs secular formats, Julian output and localised versions of the seasons etc. There are just two sides - the input parsing into the granular info and then then the output of this granular info combined with various output options and localisations. The biblatex/biber implementation is fairly comprehensive and it would likely be convenient for you to just copy its feature set and date output options.

Feel free to ask any other questions - it's a few years since I implemented most of this but I'm still actively supporting it all and so am fairly on top of it still.

bdarcus commented 1 year ago

Thanks for the thorough reply @plk!

Before I explain a bit more, the project is written in Rust, and JSON schemas are generated from that model.

So the examples I'm using for illustration are YAML, since that' s a valid format in this context.

On dates, I do have an EDTF parser I'm using, so the input end is covered.

On the output end, I am currently using configuration options drawn from the javascript Intl.DateTimeFormat. In YAML, it looks like:

dates:
  month: long

In general, these "options" are defined globally in a style, and can be overridden in the local context of a citation or bibliography.

And finally, in the templates, template "components" have a "form" property so one can do this:

- date: issued
  form: month

I think that part is sound. Do you agree?

I hadn't, however, thought much about extended dates and times.

From reviewing your manual it looks like it would be pretty easy, as I think you're suggesting, for me to add a few options on dates for things like circa and uncertain dates, and things like time-zones?

I guess some of that may need to be localized as well?

On sorting, I'll look more closely at the sorting templates.

bdarcus commented 1 year ago

One other, specific, question: why the presort field?

EDIT: oh, I see you answered that. I guess normally it would be empty, but you sometimes need it?

A related question is when you need sortname?

In my in-progress code here, I'm defining some behavior that can be called like this:

author.key()

So a few different data types (dates, contributors, titles) will share that same trait.

In that case, it will return a string just for sorting, like "doe-jane:smith-john".

plk commented 1 year ago

presort is usually empty (or equivalently, the same for every entry). To be honest, there is less need of it now we have completely flexible sorting templates and you can sort on almost any set of fields. It dates from when sorting was more primitive but is still there for backwards compat.

The output form looks fine - you just need some options for formats like short/numeric/long months ("Jan." vs "January" vs "1") and then just some options to control whether times are output at all (we don't by default as there aren't that many styles that need that), circa and uncertain markers etc, as you say. These are fairly simple to implement - they are just on/off and when on, perhaps some format for them. If you have localisations, you can do what we did and make just about everything localised so that the circa/uncertain etc. markers are localised.

sortname again is a more of a legacy thing - before we had customisable sorting templates, the sorting key for a name was hard-coded (basically lastname + initials of first names etc.) and this didn't always get the sorting people wanted so it could be overridden with sortname. These days, you can specify name sorting templates in a fully customisable manner so it's less needed but still used sometimes as a quick fix in case you don't want to define a whole new name sorting template.

There have been requests to rewrite biber in Rust which I like the idea of but it's a massive job ...

plk commented 1 year ago

Honestly, I would generate sorting keys from names via a template and set a sensible default - you'll need to use templates eventually when non-Western language users start to request it. On the other hand, it's not so hard to retrofit, I found.

bdarcus commented 1 year ago

... you just need some options for formats like short/numeric/long months.

Currently, those are only defined in options, but those can be set either globally or locally.

My assumption there is one wouldn't need a long month and a short one in the same bibliography?

But it should be easy to extend if for some reason my assumption is wrong.

There have been requests to rewrite biber in Rust which I like the idea of but it's a massive job ...

Right. And in a batch-oriented context like tex, probably not worth the hassle?

With this project, I started out with typescript, but switched to Rust because, while much more difficult in some ways, makes other things I need much simpler; namely schema generation and serializing and deserializing that data.

Also, I just think we need a CSL-ish processor that can work well in different contexts, including the web, and desktop GUIs.

While the compiler can be really annoying, things usually just work when I make it happy!

Honestly, I would generate sorting keys from names via a template and set a sensible default - you'll need to use templates eventually when non-Western language users start to request it. On the other hand, it's not so hard to retrofit, I found.

With CSL 1.0, we kind of took the approach to make some things fairly complicated and flexible upfront, not really knowing what we needed.

So not only sorting is configured via templates (what we call there "macros"), but so are author substitutions, contributor role labeling, date formatting, etc.

With this project, I'm trying to simplify wherever possible, moving much of that configuration to these options.

But multi-lingual is definitely a goal here; am just trying to get there progressively.

It may be the method for the keys and the sorting takes parameters to handle some of that, which can in turn be set in style options.

plk commented 1 year ago

Currently, those are only defined in options, but those can be set either globally or locally. My assumption there is one wouldn't need a long month and a short one in the same bibliography?

Right, those options are basically global in biblatex as we don't allow mixing bibliography styles in the same document - it's just too complicated and nobody really needs that.

Right. And in a batch-oriented context like tex, probably not worth the hassle?

Not really although people do complain about biber not being as fast as bibtex which is written in C and does quite literally about 10% of what biber does ...

With this project, I started out with typescript, but switched to Rust because, while much more difficult in some ways, makes other things I need much simpler; namely schema generation and serializing and deserializing that data.

I looked at what would be needed in Rust for biber and the things I was concerned about were a good bibtex format parser and full CLDR Unicode support. The former didn't really seem to be satisfied and the latter was all ICU-based which is good but quite complex.

With this project, I'm trying to simplify wherever possible, moving much of that configuration to these options.

I found that every time I tried to implement a simple option, in the end I had to extend it to be a fully configurable interface. Still basically an "option" but a complex one that's defined using TeX macros whose sole job is to output a complex option in XML in the .bcf file that contains all the configuration that biber needs.

But multi-lingual is definitely a goal here; am just trying to get there progressively.

It's been an issue for biblatex for some time. I have a 'multiscript' version of both biblatex and biber which is designed to handle bibliographies in multiple scripts/language and be backwards compatible. It's "beta" still and is slower as an awful lot of internals had to be altered to cope with the more complex internal data structures for multiscript bibliographies but none of the sorting/dates etc. really changed for this, mostly just the input format and the internal structures to hold the data.

bdarcus commented 1 year ago

Right. And in a batch-oriented context like tex, probably not worth the hassle?

Not really although people do complain about biber not being as fast as bibtex which is written in C and does quite literally about 10% of what biber does ...

Is Perl the kind of language where you can off-load pieces of performance-intensive processing to Rust code?

I know it's often used for that. For example, in the neovim world, plugins are written in Lua, but some projects will rewrite pieces in Rust.

Of course, if those key pieces would need to rely on crates that don't really exist ...

I looked at what would be needed in Rust for biber and the things I was concerned about were a good bibtex format parser and full CLDR Unicode support. The former didn't really seem to be satisfied and the latter was all ICU-based which is good but quite complex.

I saw the new Hayagriva project from the typst folks uses this crate.

https://crates.io/crates/biblatex

Do you know which ICU crate you were looking at?

I guess there are two; the one recommend to me on a Rust forum was this one, which is pure Rust.

https://crates.io/crates/icu

But I did find it difficult, which is why I needed help from the forum to figure out localized date formatting (which I now need to implement).

https://users.rust-lang.org/t/localized-date-time-formatting/94868

With this project, I'm trying to simplify wherever possible, moving much of that configuration to these options.

I found that every time I tried to implement a simple option, in the end I had to extend it to be a fully configurable interface. Still basically an "option" but a complex one that's defined using TeX macros whose sole job is to output a complex option in XML in the .bcf file that contains all the configuration that biber needs.

But you probably couldn't have figured out the latter without first doing the former?

I'm currently thinking on sorting to make room for other configuration options. So this:

sort:
  - contributor: author
    order: ascending
  - date: issued
    order: ascending

... becomes something like:

sort:
  bar: x # new options
  foo:
    - contributor: author
      order: ascending
    - date: issued
      order: ascending

E.g. effectively define an area to put config parameters as I need them.

plk commented 1 year ago

Is Perl the kind of language where you can off-load pieces of performance-intensive processing to Rust code?

I've not looked into Rust integration but there are ways to integrate C code. I've had a look at this sort of thing before and I think that a complete re-write is likely the best policy for performance. However, performance isn't really much of an issue, it's not slow. it's just that people who use biber like bibtex with none of the features biber offers expect it to be the same speed but then it's a batch program and so it really isn't much of an issue.

I saw the new Hayagriva project from the typst folks uses this crate. https://crates.io/crates/biblatex

I may have looked at this, I'll have another look, just out of interest.

Do you know which ICU crate you were looking at?

Can't remember offhand. ICU in general is more complex (and complete) than most Unicode libs ...

But you probably couldn't have figured out the latter without first doing the former?

Good point - that's true to some extent but in retrospect, where there were hard-coded assumptions in the structure of some option (like the parts of names and number of characters etc. to take from a name to construct a name key), I think it's best to make a user-facing template and use the template to pull the data parts as you'll inevitably have to extend it.

I'm currently thinking on sorting to make room for other configuration options. So this:

E.g. effectively define an area to put config parameters as I need them.

It depends a bit on how many new options there will be. I'd say, assume "quite a few". Not all of the sort-relevant options have to be in the sorting template - we have the template itself (effectively what you have here) and then other complex options which determine other aspect of sorting (such as the sort exclusions, name key generation etc.). If you have a look at a sample .bcf file (.bcf stands for "biber control file" - it contains everything biber needs to run against some bibliography data - the entire data model, all options, file locations, everything - biber reads that file, then reads any bib data files it finds in there and outputs a .bbl file - that's all of the inputs/outputs).

For example, here you can see an examples from the regression test files of a .bcf with multiple name sorting templates and sorting templates (search for the comments "SORTING NAME KEY TEMPLATE" and "SORTING TEMPLATE"):

https://github.com/plk/biber/blob/dev/t/tdata/basic-misc.bcf

You'll see that the sorting templates don't contain the name key generation template - that's a separate option.

bdarcus commented 1 year ago

For example, here you can see an examples from the regression test files of a .bcf with multiple name sorting templates and sorting templates (search for the comments "SORTING NAME KEY TEMPLATE" and "SORTING TEMPLATE"):

Oh WOW!

bdarcus commented 1 year ago

OK, so looking at this example:

  <bcf:sortingtemplate name="nty">
    <bcf:sort order="1">
      <bcf:sortitem order="1">presort</bcf:sortitem>
    </bcf:sort>
    <bcf:sort order="2" final="1">
      <bcf:sortitem order="1">sortkey</bcf:sortitem>
    </bcf:sort>
    <bcf:sort order="3">
      <bcf:sortitem order="1">sortname</bcf:sortitem>
      <bcf:sortitem order="2">author</bcf:sortitem>
      <bcf:sortitem order="3">editor</bcf:sortitem>
      <bcf:sortitem order="4">translator</bcf:sortitem>
      <bcf:sortitem order="5">sorttitle</bcf:sortitem>
      <bcf:sortitem order="6">title</bcf:sortitem>
    </bcf:sort>
    <bcf:sort order="4">
      <bcf:sortitem order="1">sorttitle</bcf:sortitem>
      <bcf:sortitem order="2">title</bcf:sortitem>
    </bcf:sort>
    <bcf:sort order="5">
      <bcf:sortitem order="1">sortyear</bcf:sortitem>
      <bcf:sortitem order="2">year</bcf:sortitem>
    </bcf:sort>
    <bcf:sort order="6">
      <bcf:sortitem order="1">volume</bcf:sortitem>
      <bcf:sortitem literal="1" order="2">0</bcf:sortitem>
    </bcf:sort>
  </bcf:sortingtemplate>

Let me see if I understand this sortingTemplate model:

    <bcf:sort>
      <bcf:sortitem>sorttitle</bcf:sortitem>
      <bcf:sortitem>title</bcf:sortitem>
    </bcf:sort>

So here, maybe something like this as a second cut:

#[derive(Default, Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct Sort {
    pub config: SortConfig,
    pub template: Vec<SortTemplate>,
}

#[derive(Default, Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct SortTemplate {
    pub key: SortKey,
    pub order: SortOrder,
}

#[derive(Default, Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub enum SortOrder {
    #[default]
    Ascending,
    Descending,
}

#[derive(Default, Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub enum SortKey {
    #[default]
    Author, // by default, substitution rules apply
    Editor,
    IssuedYear,
    Type,
}

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct SortConfig {
    /// Shorten name lists for sorting the same as for display.
    pub shorten_names: bool,
    /// Use same substitutions for sorting as for rendering.
    pub render_substitutions: bool,
    // etc
}

impl Default for SortConfig {
    fn default() -> Self {
        Self {
            shorten_names: false,
            render_substitutions: true,
        }
    }
}

So in YAML:

sort:
  template:
    - author
    - issued-year

Where default for order, config and substitution are already set.

plk commented 1 year ago
  • the sort element defines the key to use, and the sortItem elements are the ordered options to select from?

The general semantics is that the sort elements group an ordered list of semantically similar fields to sort on, so "sort by some type of author-y field, in order of order preference, then by some type of title-y field, in order of order preference ..."

  • the purpose of the "literal" attribute? To provide a fallback key?

This provides a fixed place in the sorting for when you want to give the sorting key for this part when there is no suitable field (for example, if there is no volume field sort after all otherwise equivalent entries that do have one).

  • why do you need the "order" attribute, given elements are ordered? Is that another legacy detail?

Artefact of the library I use - it reads into a random-ordered hash so I have this to make sure of the order. Also was just in case of issues in the biblatex code that writes the .bcf where things would need to be merged etc.

So in effect, the following is functionally equivalent from an XML POV?

    <bcf:sort>
      <bcf:sortitem>sorttitle</bcf:sortitem>
      <bcf:sortitem>title</bcf:sortitem>
    </bcf:sort>

Actually, sorttitle is a special field so that sortX fields are used to sort the X fields if found. Again, this was used to deal with awkward fields which had tricky contents that would generate nice sorting keys (TeX fields with lots of maths in them, for example). The default sorting templates always use sortX for sorting before X.

So here, maybe something like this as a second cut Where default for order, config and substitution are already set.

This looks nice, yes.

bdarcus commented 1 year ago

OK, I merged the initial results of this very useful discussion; both the adjustments to the sort model, and added a couple of parameters for dates here (I'll need to figure out how to get the localized date-formatting + EDTF code working before figuring out what more I need; it's a much bigger hassle than in JS):

https://github.com/bdarcus/csln/commit/233b00f9825406e3ba2789df70c4e95f801919ce

Hopefully I can keep the sort model simple-ish :-)

bdarcus commented 1 year ago

I saw the new Hayagriva project from the typst folks uses this crate. https://crates.io/crates/biblatex

I may have looked at this, I'll have another look, just out of interest.

I missed earlier that this crate is actually written by the typst devs, so is also fairly newly-available.