citation-style-language / csl-evolution

Central repository for coordinating CSL development
2 stars 2 forks source link

Dash-normalization #36

Closed denismaier closed 4 years ago

denismaier commented 4 years ago

In the issue about splitting title-main and title-sub, one question was about em vs en dashes. Should we add an option to normalize dashes in a textual context as we already change hyphens to en-dashes in a numerical context (we do that, right?). I guess that should, most likely, be locale depended, e.g., em dashes for US English, en dashes for most other locales?

bwiernik commented 4 years ago

What are the citeproc current behaviors? I've not heard any complaints about this, so whatever the current behavior is would be correct I'd think.

denismaier commented 4 years ago

Current behaviour in citeproc-js seems to be that with "uppercase_subtitles": false dashes are not touched at all. With "uppercase_subtitles": true a couple of combinations get normalized, but all to em dashes, en dashes aren't recognized as relevant punctuation, i.e., no conversion to em dashes and no uppercasing afterwards. (Note that two hyphens are converted to em dashes, which is a bit against most plain text conventions, I think.) Locales don't seem to have any effect.

Here are some text cases:

>>===== MODE =====>>
citation
<<===== MODE =====<<

>>===== RESULT =====>>
Title input with em dash—Should start with uppercase
Title input with space-em-dash-space—Should normalize and start with uppercase
Title input with space-hyphen-space—Should normalize and start with uppercase
Title input with space-double-hyphen-space—Should normalize and start with uppercase
Title input with space-triple-hyphen-space—Should normalize and start with uppercase
Title input with double-hyphen—Should normalize and start with uppercase
Title input with triple-hyphen—Should normalize and start with uppercase
Title input with space-endash-space – Should keep endash and start with uppercase
<<===== RESULT =====<<

>>===== CITATION-ITEMS =====>>
[
  [
    {
      "id": "ITEM-1"
    }
  ],
  [
    {
      "id": "ITEM-2"
    }
  ],
  [
    {
      "id": "ITEM-3"
    }
  ],
  [
    {
      "id": "ITEM-4"
    }
  ],
  [
    {
      "id": "ITEM-5"
    }
  ],
  [
    {
      "id": "ITEM-6"
    }
  ],
  [
    {
      "id": "ITEM-7"
    }
  ],
  [
    {
      "id": "ITEM-8"
    }
  ]
]
<<===== CITATION-ITEMS =====<<

>>===== OPTIONS =====>>
{
    "uppercase_subtitles": false
}
<<===== OPTIONS =====<<

>>===== CSL =====>>
<style 
      xmlns="http://purl.org/net/xbiblio/csl"
      class="note"
      version="1.0"
      default-locale="en">
  <info>
    <id />
    <title />
    <updated>2009-08-10T04:49:00+09:00</updated>
  </info>
  <citation>
    <layout delimiter="; ">
      <text variable="container-title"/>
    </layout>
  </citation>
</style>
<<===== CSL =====<<

>>===== INPUT =====>>
[
    {
        "id": "ITEM-1", 
        "container-title": "Title input with em dash—should start with uppercase", 
        "type": "article-journal"
    },
    {
        "id": "ITEM-2", 
        "container-title": "Title input with space-em-dash-space — should normalize and start with uppercase", 
        "type": "article-journal"
    },
    {
        "id": "ITEM-3", 
        "container-title": "Title input with space-hyphen-space - should normalize and start with uppercase", 
        "type": "article-journal"
    },
    {
        "id": "ITEM-4", 
        "container-title": "Title input with space-double-hyphen-space -- should normalize and start with uppercase", 
        "type": "article-journal"
    },
    {
        "id": "ITEM-5", 
        "container-title": "Title input with space-triple-hyphen-space --- should normalize and start with uppercase", 
        "type": "article-journal"
    },
    {
        "id": "ITEM-6", 
        "container-title": "Title input with double-hyphen--should normalize and start with uppercase", 
        "type": "article-journal"
    },
    {
        "id": "ITEM-7", 
        "container-title": "Title input with triple-hyphen---should normalize and start with uppercase", 
        "type": "article-journal"
    },
    {
        "id": "ITEM-8", 
        "container-title": "Title input with space-endash-space – should keep endash and start with uppercase", 
        "type": "article-journal"
    }
]
<<===== INPUT =====<<
denismaier commented 4 years ago

After thinking a bit more about this, I tend to think that dashes shouldn't be normalized unless they are delimiters between title and and subtitle or between multiple subtitles. (Converting -- to en dash and --- to em dash is a different thing. We should certainly do this. Not sure about single hyphens...)

bwiernik commented 4 years ago

Thinking through this more, this is a locale-dependent dependent thing (e.g., Bristish English and German generally prefer space-en dash-space instead of em dash in text). I'm not sure whether that's something we would need to bother with?

I think that a single space-hyphen-space should probably be normalized to em dash or space-en dash-space.

denismaier commented 4 years ago

So that would mean: space-hypen-space => em dash or space-en dash-space (depending on locale?) hyphen-hypen => en dash hypen-hyphen-hypen => em dash

As said above, I don't think we should normalize dashes unless a dash is a delimiter between title and subtitle, and normalize-title-delimiters is set to "full"---but then we will most likely normalize to a colon or a period (not to a dash).

bwiernik commented 4 years ago

Yeah, let's leave it to publishers to normalize dashes if they want that.

Looking at my library of items, almost all - are low-quality metadata imports that should properly be colons separating subtitles. Just a few are German publications that should be en dashes.

For simplicity, I think let's just leave single hyphens alone.

denismaier commented 4 years ago

Ok, good. So that gives us:

hyphen-hypen => en dash hypen-hyphen-hypen => em dash

So we don't a schema change. A paragraph in the specs (aimed at processor implementors) will be enough. I'll close here and open a new issue there.