Juris-M / zotero

Juris-M is a variant of the free and friendly Zotero research platform, with support for legal and multilingual materials.
https://juris-m.github.io
Other
75 stars 12 forks source link

If language variant exists, default is not reachable #93

Closed georgd closed 3 years ago

georgd commented 3 years ago

After the 'semi-clean' install of the new release, Juris-M is always using the enIBFD variant when I’m citing Austrian cases with jm-leg-cit styles — although enIBFD is not listed among the variants in jurisdiction-preference.

georgd commented 3 years ago

Sorry, my little brains aren’t fully functional today, as it seems. Sure, I had reported this already and you even commented that it was the language arbitration bug. Should we close this too?

fbennett commented 3 years ago

That's possible. I've been struggling to refamiliarize myself with the data flow around abbreviations (the structures themselves are a challenge, and the code that handles them is very gradually improving, but it's not the most readable). We can leave this open as a marker, and see if it clears once locale arbitration comes onstream.

fbennett commented 3 years ago

I think I have this working in code. Will put up a beta for inspection after some code cleanup.

fbennett commented 3 years ago

Client beta 5.0.90m7 is now available for testing on the Mac.

georgd commented 3 years ago

Thanks. The new version doesn’t pick up any style-module: When citing an item that should be rendered via style-module, I get [CSL STYLE ERROR: reference with no printed form.]. This happens with the leg-cit styles as well as with the indigobook styles.

fbennett commented 3 years ago

Glad it was a beta! For abbrev locale arbitration, it's appending a suffix to jurisdiction in that context. That must be leaking into the jurisdiction value used to fetch modules. There are tests for default modules, but none for modules with domain extensions, so it passes tests, but fails on our production styles. Should be an easy fix. More soon.

georgd commented 3 years ago

On first reading, I don’t fully understand your analysis — will reread :). But for the time being, the original variant is still not reachable. jm-ibfd, which lists englished enIBFD in jurisdiction-preference is giving me German abbreviations for eu.int cases.

fbennett commented 3 years ago

That was thinking out loud, you can ignore it. I was unable to reproduce the error in an initial trial under Linux, but ran into a larger error---the Mac isn't upgrading the DB, which causes the entire abbrevs infrastructure to fail. Investigating now ...

On Thursday, October 15, 2020, Georg Mayr-Duffner notifications@github.com wrote:

On first reading, I don’t fully understand your analysis — will reread :). But for the time being, the original variant is still not reachable. jm-ibfd, which lists englished enIBFD in jurisdiction-preference is giving me German abbreviations for eu.int cases.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Juris-M/zotero/issues/93#issuecomment-709173870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASMSRLDUAQBRHSCMQ4ZVLSK3IC7ANCNFSM4SORY5GQ .

fbennett commented 3 years ago

This particular bug-hunt grew into something of a nightmare. It was due to an obvious syntax error in an SQL statement that hadn't triggered during development under Linux for some reason, but I've needed to release a long series of betas, as that was the only way to get revised code onto the Mac (direct builds on the Mac do not work, for reasons I'm not keen to look into).

(I'll be rebasing the repository to clean up the mess this has made in the project history, so if you've pulled anything in the past few hours, you may need to rewind your clone a bit to line it up with GitHub.)

fbennett commented 3 years ago

With the abbreviations upgrade issue resolved on the Mac beta, IndigoBook styles seem to be working fine. Possibly this has resolved the style module issue as well?

georgd commented 3 years ago

Now, I’m seeing something curious: I updated my Jurism installation and observed:

Then, I removed Jurism and all its files and re-installed it. Now, I‘m back to [CSL STYLE ERROR: reference with no printed form.] with styles that use style modules (indigobook and leg-cit alike) and ibfd again applies German abbreviations which it doesn’t call for.

georgd commented 3 years ago

One more thing: there’s a discrepancy between citations and bibliography (with ibfd, as I don’t see anything with indigobook and leg-cit):

AT court (existing variants: enIBFD; desired variant: enIBFD): citation: enIBFD abbreviation applied bibliography: no abbreviation applied

EU.int court (existing variants: de; desired variant: default): citation: de abbreviation applied bibliography: de abbreviation applied

~FR court (existing variants: enIBFD; desired variant: enIBFD): citation: default abbreviation applied bibliography: no abbreviation applied~ (bad example: no IBFD abbreviation for the cited court)

georgd commented 3 years ago

Hmmm. After restarting Word and Jurism for the third time, leg-cit and indigobook brought back the printed forms of legal case citations. Still, the application of abbreviations is not following a pattern that I could recognise:

coe.int:

eu.int:

nl:

at:

fbennett commented 3 years ago

Many thanks for your patience with all the testing. It should be better now. With the latest update, the beta will again update the 22 jurisdictions for which there are domain extensions for abbrev variants. All of your examples above now draw correct abbreviations in my testing here.

Fitting that this (near?) last bug was in a regular expression. https://github.com/Juris-M/abbrevs-filter/commit/5e747880d5c409dbdf9aa0df191d74b2df0c5279

georgd commented 3 years ago

I think we’re getting closer but we’re not there yet.

This is what happens in an empty document, using a JM Leg Cit style.

  1. Add citation to ECJ case + bibliography: abbreviations (variant de) applied correctly.
  2. Add citation to ECHR case (coe.int): abbreviations applied correctly.
  3. Add citation to Austrian supreme court case: AT-abbreviations are applied correctly but the eu.int and coe.int abbreviations disappeared and are replaced by the court codes.

In another document: NL court is correctly abbreviated in the bibliography but the code is printed in the citation (might be a different issue, as there’s no style module for NL).

fbennett commented 3 years ago

Thanks for the steps to reproduce. I'll give that a try here, and report back.

On Fri, Oct 16, 2020 at 4:35 PM Georg Mayr-Duffner notifications@github.com wrote:

I think we’re getting closer but we’re not there yet.

This is what happens in an empty document, using a JM Leg Cit style.

  1. Add citation to ECJ case + bibliography: abbreviations (variant de) applied correctly.
  2. Add citation to ECHR case (coe.int): abbreviations applied correctly.
  3. Add citation to Austrian supreme court case: AT-abbreviations are applied correctly but the eu.int and coe.int abbreviations disappeared and are replaced by the court codes.

In another document: NL court is correctly abbreviated in the bibliography but the code is printed in the citation (might be a different issue, as there’s no style module for NL).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Juris-M/zotero/issues/93#issuecomment-709880070, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAASMSQHTHMUYVK7MTW7JE3SK7ZUJANCNFSM4SORY5GQ .

fbennett commented 3 years ago

Found the bug, and see how to fix it, but it's late here. Will get out a fix first thing in the morning.

georgd commented 3 years ago

Thank you very much! I’m really sorry for being such a nuisance. If I can improve how to report issues, please tell me (now, when writing it occurs to me: would a debug ID have been helpful?).

fbennett commented 3 years ago

Good morning. It's fine! The code is green, but the logic of it all is fresh in mind, and your steps to reproduce above were very clear.

The source of the bug was quick to pin down. Abbreviations are placed in a database on install, to save memory and allow for user edits. Database access is asynchronous, but the processor runs synchronously, and isn't able to access the database directly, so abbreviations potentially needed for each item are loaded into memory before the processor is run. That's done by loading the abbreviation sets available for the top-level jurisdiction of the item under an appropriate key (e.g. for coe.int, it would be coe.int & coe.int@de, and for at, it would be at & at@enIBFD), and storing a list of the variant domains in a variable (availableAbbrevDomains). Then that list is checked against the domains preferred for the language of the item (so for JM leg cit mit Literaturverzeichnis in the de locale that would be de, deAT & LegCit). The processor chooses a domain from the intersection of the two lists, and applies that list to item content. So if the jurisdiction is coe.int, the coe.int@de list is selected.

The problem is that availableAbbrevDomains is being overwritten for each item pre-scanned, and the processor is given the list of the last item scanned (oops). That leads to a potential mismatch between the keys requested by the processor and the keys available in memory. In this case, the processor was (for example) calling for at@de, which ... doesn't exist.

The fix should be straightforward: availableAbbrevDomains needs to persistently store the lists for each top-level jurisdiction encountered, so the processor can fetch the correct list for evaluation. That's the theory, now we'll see how it works out in practice ...

More soon!

fbennett commented 3 years ago

A fresh beta is up for the Mac. Should be closer!

georgd commented 3 years ago

It think, now this is as close as it gets. Only, the NL court still shows no abbreviation but the code in the citation. How das fallback work when no style module exists for a jurisdiction?

fbennett commented 3 years ago

Just checked. This one is not a bug in the recent processor code, but it exposes a trap for the unwary that we may be able to address with a small change in the processor. I'll fill in some background first, then note a possible way to protect against this kind of anomaly. (The background may be [painfully] familiar, but I'm posting it here for the benefit of others who drive by this thread in the future.)

The extensions in CSL-M were initially aimed at US legal styles, which have some specific quirks, one of which dates back to the 1980's. From the beginning of the 20th century, official publication of court judgments across the US was routed through West Publishing. Citations in court filings referred to the West reporters by volume, reporter name, page, and year. This worked nicely until computer networks arrived and gave rise to a couple of issues with the old system. First, there was a significant time lag between release of the slip opinion by the courts and arrival of the official report from West, an inconvenience that attracted increasing attention as information systems generally grew faster over time. Second, and more importantly, it become clear (through a lawsuit on the subject) that reliance on West page numbering in official citations gave West a great deal of market power.

In response, a number of states introduced "vendor-neutral" or "public-domain" citation systems set directly by their courts. This effort was spotty, and added to the challenges of building automated referencing systems for US law. The treatment is uneven---only about a dozen states moved to vendor-neutral systems. It is inconsistent---each of the vendor-neutral formats differs from the others. It is partial---only one state (Oklahoma) back-fit vendor-neutral cite IDs to older cases. It also impacts parallel-citation logic.

To cope with the coexistence of West-official and vendor-neutral citation formats, Jurism needs two separate abbreviations for court names: a normal abbreviation for use in West citations (or cites to slip opinions); and a court code for use in vendor-neutral cites. The way that's done in CSL-M is to register two categories of abbreviation for institution names: institution-part to cover the former case, and institution-entire covering the latter. This is not yet documented as well as it should be, but these are the forms:

  1. institution-part abbreviation:
    <names variable="authority">
        <name/>
        <institution instiution-parts="short"/>
    </names>
  2. institution-entire abbreviation:
    <names variable="authority">
        <name/>
        <institution form="short"/>
    </names>

    In citation context, the jm-leg-cit-rechtsquellenverzeichnis-literaturverzeichnis style calls juris-main-short to render the legal_case type. In the juris-nl.csl module (which is bundled and should exist), juris-main-short calls authority with form="short" (the institution-entire short-form). The selected abbrevs file for the rendering will be auto-nl.json, which has court-code definitions for institution-part, but not institution-entire. No abbreviation is found under that category, so the system falls back to rendering the raw code.

This is easy to fix, by either providing an ABBREV for each court (which compiles to institution-entire), or (probably better) by calling the authority variable in the juris-nl.csl module; but the need to coordinate code across multiple files with a non-obvious relationship is a formula for bugs and confusion.

The way the desc compiler is set up, all courts will have an institution-part abbreviation for their code. The glitch in this case could be addressed by adjusting the processor to fall back to abbreviating with institution-part if and attempt at institution-entire fails.

And here at the end of that long-long story ... what do you think?

fbennett commented 3 years ago

... it also looks like the juris-nl.csl module itself needs some formatting attention. :-/

I've been toying with another idea for some time that might reduce the burden of maintaining our growing family of style modules. Like locale evaluation, which falls back to en-US, the search for style modules currently ends with selection of the US as a fallback. Since the "legal families" tend to cluster in their citation formats, I've been thinking that modules should be able to designate an intermediate fallback that's closer to their requirements. Your thoughts on that one?

georgd commented 3 years ago

This is easy to fix, by either providing an ABBREV for each court (which compiles to institution-entire), or (probably better) by calling the authority variable in the juris-nl.csl module; but the need to coordinate code across multiple files with a non-obvious relationship is a formula for bugs and confusion.

The way the desc compiler is set up, all courts will have an institution-part abbreviation for their code. The glitch in this case could be addressed by adjusting the processor to fall back to abbreviating with institution-part if and attempt at institution-entire fails.

Seeing all the history, I‘m not sure if some other changes should be attacked as well. Semantically, shouldn’t the European court abbreviations go into ABBREV? The German courts are already organized like that, which makes sense as the federal courts are often cited by references to the official reporter. But as far as my researches reached, so far, that’s an exception. Thus, the direction of the fallback would go the other way round.

I've been toying with another idea for some time that might reduce the burden of maintaining our growing family of style modules. Like locale evaluation, which falls back to en-US, the search for style modules currently ends with selection of the US as a fallback. Since the "legal families" tend to cluster in their citation formats, I've been thinking that modules should be able to designate an intermediate fallback that's closer to their requirements. Your thoughts on that one?

Once during the last months I wondered how the fallbacks worked and if some jurisdictions shouldn’t fall back to a US like format. So, yes, I think that’s a good idea. Do we have enough informations to more or less reliably cluster them?

fbennett commented 3 years ago

If courts in a jurisdiction are always cited with a short-code, never by a descriptive name, the code can safely be set as "abbrev." The choice can be revisited if descriptive names later become necessary: because the abbreviations are set automatically, and are called via modules that update together with the abbreviation lists, a change from using institution-part to using institution-entire would be transparent to the user.

The fallback from "ABBREV" (institution-entire) to "abbrev" (institution-part) is the right direction for it, because only the latter is guaranteed to exist. The compiler script makes these assignments in sequence when processing the desc file:

So all entries must have a "name" value for the UI menu label, and that is used for the institution-part abbreviation if no "abbrevs" value is given. If no "ABBREV" value is given, though, there will be no institution-entire value. The fallback in the processor to institution-part assures that the court code will never appear in citations.

Good to hear that you've had the same thought about style-module fallbacks. I'll open a separate issue for it.