xeCJK 处理破折号宽度的算法（附解决方案）

根据 Issue 158 的这条评论，在使用思源黑体与思源宋体时，两个 em dashes 无法合并成一个 two‑em dash，三个 em dashes 无法合并成一个 three‑em dash。

主要原因是已知的：xeCJK 会给连续两个字符之间插入别的代码，从而阻止了 XeTeX 合字。国内大多数字库没有 U+2E3A 与 U+2E3B（这条评论）的字形，所以 xeCJK 目前对破折号自动合字不支持情有可原。

本条 issue 还想指出另外一个问题：xeCJK 处理破折号宽度的算法似乎有误，得到的“非合字”破折号总宽并非两个全角字宽。以下要求引自《中文排版需求》：

破折号表示语气或声音的延续、语意的转换或行文的补充。是占两个汉字空间的 U+2E3A TWO-EM DASH [⸺] 或 U+2014 EM DASH [—]。

不用 xeCJK，只用 fontspec，可以实现合字，并且符号的宽度都是全角宽：

% !TeX program = XeLaTeX
% Download SourceHanSerifSC-Regular.otf for this test:
% https://github.com/adobe-fonts/source-han-serif/tree/release/OTF/SimplifiedChinese
\documentclass{article}
\usepackage{fontspec}
% Need full-width feature:
% https://github.com/CTeX-org/ctex-kit/issues/320
\setmainfont{SourceHanSerifSC-Regular.otf}[RawFeature=+fwid]
\setmonofont{Latin Modern Mono}[Scale=MatchLowercase]
\begin{document}
\fontsize{10.5bp}{16.38bp}\selectfont
中文—半破折号：输入一个\texttt{ U+2014}，自动按照\texttt{ U+2015 }全角字面输出。\par
中文——破折号：输入两个\texttt{ U+2014}，自动连成\texttt{ U+2E3A }并按全角字面输出。\par
中文———符号：输入三个\texttt{ U+2014}，自动连成\texttt{ U+2E3B }并按全角字面输出。
\end{document}

qq1

用 xeCJK，不能实现合字，并且符号的宽度不再是全角宽（其中，思源宋体的“破折号”占 1.9130 字宽，中易宋体的“破折号”占 1.9414 字宽）：

% !TeX program = XeLaTeX
% Download SourceHanSerifSC-Regular.otf for this test:
% https://github.com/adobe-fonts/source-han-serif/tree/release/OTF/SimplifiedChinese
\documentclass[fontset=none]{ctexart}
\setmainfont{TeX Gyre Termes}[Scale=1.101208]% Scale=729/662
\setmonofont{Latin Modern Mono}[Scale=MatchLowercase]
% Declare three-em dash (U+2E3B) as FullRight:
\xeCJKDeclareCharClass{FullRight}{`⸻}
% Need full-width feature:
% https://github.com/CTeX-org/ctex-kit/issues/320
\setCJKmainfont{SourceHanSerifSC-Regular.otf}[RawFeature=+fwid]
% Also, test SimSun
\setCJKsansfont{SimSun}
\usepackage{xcolor}
\usepackage{array}
\usepackage{booktabs}
\newcommand*\header[1]{\multicolumn{1}{c}{#1}}
\newcommand*\ideographicbaseline{-0.12}
\newcommand*\ccbox[1][1]{%
  \smash{%
    \rlap{%
      \color{blue}%
      \setlength\fboxrule{0.1pt}%
      \setlength\fboxsep{-\fboxrule}%
      \fbox{%
        \rule[\ideographicbaseline\ccwd]{0pt}{\ccwd}%
        \rule{#1\ccwd}{0pt}%
      }%
    }%
  }%
  \kern#1\ccwd
}
\begin{document}
\begin{tabular}{>{\ttfamily}l l}
\toprule
 \header{Unicode} & \header{Output} \\
\midrule
 1 U+2014        & \rlap{\ccbox\ccbox\ccbox\ccbox\ccbox\ccbox\ccbox}中文—半破折号 \\
 hspace: 1 ccwd  & 中文\hspace{\ccwd}半破折号 \\
\cmidrule{1-2}
 2 U+2014s       & \rlap{\ccbox\ccbox\ccbox[1.9130]\ccbox\ccbox\ccbox}中文——破折号 \\
 1 U+2E3A        & 中文⸺破折号 \\
 hspace: 2 ccwds & 中文\hspace{2\ccwd}破折号 \\
\cmidrule{1-2}
 3 U+2014s       & \rlap{\ccbox\ccbox\ccbox[2.8260]\ccbox\ccbox}中文———符号 \\
 1 U+2E3B        & 中文⸻符号 \\
 hspace: 3 ccwds & 中文\hspace{3\ccwd}符号 \\
\bottomrule
\end{tabular}

\sffamily
\renewcommand*\ideographicbaseline{-0.140625}% -36/256
中易宋体：\rlap{\ccbox\ccbox\ccbox[1.9414]\ccbox\ccbox\ccbox}中文——破折号
\end{document}

qq2

最好的解决方法，是在遇到连续两三个 U+2014 时，决定是否能够使用合字。又因为合字功能取决于字体，所以这个功能作为用户布尔键值，可能会更加合理。但如果上述功能在实现上有困难，那么至少应该保证“非合字”破折号的总宽度为两个全角宽。比起朝鲜文的合字，U+2014 出现的频率要高很多。

破折号与省略号似乎都有被 xeCJK 作特殊处理，目前没有发现省略号在宽度方面的问题。又因为不存在所谓的 two‑ellipsis，也不需要考虑合字的问题。

在使用思源黑体与思源宋体时，如果想要恢复 U+2014 的合字效果，仿照着 HangulJamo 字符类别来写如下一个 hack 似乎就可以了（楼下：破折号前面有标点时出现 bug；同样楼下：可以修复）：

% !TeX program = XeLaTeX
% Download SourceHanSerifSC-Regular.otf for this test:
% https://github.com/adobe-fonts/source-han-serif/tree/release/OTF/SimplifiedChinese
\documentclass[fontset=none]{ctexart}
\setmainfont{TeX Gyre Termes}[Scale=1.101208]% Scale=729/662
\setCJKmainfont{SourceHanSerifSC-Regular.otf}[RawFeature=+fwid]% Need full-width feature

\ExplSyntaxOn
\xeCJK_new_class:n { PoZheHao }
\__xeCJK_save_CJK_class:n { PoZheHao }
\xeCJK_declare_char_class:nn { PoZheHao } { "2014 }
\seq_map_inline:Nn \g__xeCJK_class_seq
  {
    \str_if_eq:nnF {#1} { PoZheHao }
      {
        \xeCJK_copy_inter_class_toks:nnnn { PoZheHao } {#1} { FullRight } {#1}
        \xeCJK_copy_inter_class_toks:nnnn {#1} { PoZheHao } {#1} { FullRight }
      }
  }
\ExplSyntaxOff

\usepackage[width=147bp]{geometry}% 每行 14 个字

\begin{document}

想象力比知识更重要，因为知识是有限的，而想象力概括着世界的一切。\par\nobreak
\hfill ——Albert Einstein

\medskip

你的生日——四月十八日——每年我总记得。（曹禺《雷雨》）

\end{document}

testing

因为合字成功，即便输入的两个 U+2014 之间不加任何代码，也不会错误断行。~~这说明，如果合字成功的话，就不必修改 XeTeXdashbreakstate 了。~~

对于不支持破折号合字的大多数字体，还需在破折号的两端添加/删除空白使其占两字宽。现在通篇替换 —— 为 \makebox[2\ccwd][c]{——} 似乎也还行（误）。

涉及到复制的话，还是会有问题：默认情况下，思源宋体的破折号从 PDF 复制出来是两个 U+2014，但开了 fwid 或者 locl 之后，复制出来就是两个 U+2015。

类似的是 U+2E3A 和 U+2E3B。开了 fwid 或 locl 后，复制出来的分别是 U+10F34C 和 U+10F34D （这个在私有区了）。感觉跟之前那个 #286 应该还是同一个问题。

@Stone-Zeng 根据小林剑在思源宋体 Issue 63 的回复，按照这种特殊方式来处理破折号是他们团队有意为之。不开 fwid 时，输入的 U+2014 是用思源字体西文部分的字面输出（偏下、偏短）。在我给的例子中，如果想要得到真正的西文 em dash（即 TeX Gyre Termes 字体的 em dash），必须通过连续输入三个 U+002D 的方式获得，也即 TeX 本身的合字。

开了 fwid 之后，思源字体会把 U+2014 替换为 U+2015，私有区的那两个 two‑em dash 和 three‑em dash 也是从 U+2E3A 和 U+2E3B 替换来的，这三个符号垂直居中，且长度为字号的整数倍。替换命令包括这条、这条和这条，合字命令包括这条、这条、这条和这条。复制 PDF 的结果也就不奇怪了。

比较尴尬的还是 xeCJK 现行方法得到的破折号不占两个字宽，估计是因为只在两个 U+2014 之间加了负的 kern，却忘了在两端补空白。xeCJK 之所以花功夫也是由于国内大部分字库造的 U+2014 有问题。对于思源字体，就该用 fwid 或者 locl 来得到 CJK 标点，而且人家还提供了破折号的合字功能，不开 fwid 的话符号偏下。

经过多次尝试，搞出来一个解决方案，测试了多种字体都通过了。但是，xeCJK 关于标点压缩的源代码繁杂，自己仍然没有太大把握，还想请 @qinglee @leo-liu @Liam0205 指点、决策。

目前算法

目前 xeCJK 涉及破折号的算法有两个。

读取 U+2014 的字框与字面，获得该字形左右两端的总空白，然后插入同等大小的负 kerning，保证连续两个 U+2014 的输出效果是恰好连在一起的，这个由 \@@_long_punct_kerning:N 解决；
对于需要水平居中对齐的标点，计算其左右两端的总空白，根据需要抹去（\@@_punct_bound_rule:NN）或减少（\@@_punct_rule:NN），再平均分配到字形两端，以 glue 的形式出现，这个由 \xeCJK_punct_margin_process:NN 解决。注意，这里的抹去是有可能增加 kerning 的（例如字形超出字框），而默认的 glue 大于或等于零。

这些繁杂的算法是为了弥补很多字库在设计上的缺陷。在以下的自测中，我遇到四类情况：

U+2014 字形窄于一个汉字。代表字体：中易系列字体。
U+2014 字框本身宽于一个汉字。代表字体：微软雅黑。
U+2014 字框宽等于字号，但是字形超出字框。代表字体：方正兰亭黑。
本身设计优良、带有破折号合字功能的 OpenType 字体，其 U+2014 为西文字形，需要指定全角标点将它替换为中文字形。代表字体：思源系列字体。

目前算法的主要缺点

输出的破折号（两个 U+2014）在水平方向上不占两个汉字字宽。
破折号中间的负 kerning 计算还需改进，要兼容上述第 2、3 类字体。
不支持破折号合字，也就不支持上述第 4 类字体。

解决方案

增加破折号两端的空白：`\xeCJK_punct_margin_process:NN`

原始的 glue 为 ( \l_@@_tmp_dim - ( \@@_use_punct_dim:nN { dimen } #2 ) ) / 2，适用于单独一个标点。两个 U+2014 连续出现时，两者中间被挤压掉的空白刚好等于一个 U+2014 左右的总空白，此时将 / 2 移除（仅对 U+2014 移除）就可以使破折号占两个字宽了。

注意：不除以 2 的操作，会导致单独一个 U+2014 不再占一个字宽，而且这个操作与 OpenType 的破折号合字不兼容。因此，

不除以 2 的适用范围：没有破折号合字功能时，仅仅对一个字符 U+2014 两端各添加一整个空白。

改进负 kerning 的计算：`\@@_long_punct_kerning:N`

目前源代码仅仅在 \l_@@_bound_dim + \l_@@_reverse_bound_dim 与 \c_zero_dim 中取较大，改进如下：

长标点之 U+2E3A、U+2025、U+2026，两个相同字符连续出现时，中间不需要负 kerning。只有（没好好设计的） U+2014 有这个需求。设 U+2014 的字框宽为 w = width，字形宽为 d = dimen，则

取三者最大：1) w - d；2) d + w - 2*字号；3) 2*w - 2*字号，然后取其相反数作为 kerning 值。这样，在 w、d 与字号偏差不大的多数情况下，能够同时保证：a) 破折号中间无空白；b) 破折号（字形或字框）总宽不超过两个汉字字宽。

支持破折号合字：新建字符类别

这里不再重复我在上面提到的 PoZheHao 字符类别。当然，还要为 \@@_punct_if_right:N 打补丁，让它知道遇到 PoZheHao 也可以返回 true。值得注意的是，思源字体还支持一系列形如 <3042 3099>、<304B 309A>、<3033 3035>、<3034 3035> 的日文合字，前两种已经被 CM 字符类别收录，后两个没有（将 "3033 -> "3035 归为 PoZheHao 或者 HangulJamo 类别就可以了，仅用于竖排）。

改进前后对比

before_after

最小工作示例

启用 \fixpozhehaotrue 即可看到改进后的效果（涉及方正兰亭黑的测试代码只有四行）。

% !TeX program = XeLaTeX
% !OS = Windows 8.1
\documentclass[linespread=1.2]{ctexart}
\setmainfont{Latin Modern Roman}
\setmonofont{Latin Modern Mono}
\newCJKfontfamily\NotoSerifCJKExtraLight{Noto Serif CJK SC ExtraLight}[CharacterWidth=Full]
\newCJKfontfamily\NotoSerifCJKLight{Noto Serif CJK SC Light}[CharacterWidth=Full]
\newCJKfontfamily\NotoSerifCJKRegular{Noto Serif CJK SC}[CharacterWidth=Full]
\newCJKfontfamily\NotoSerifCJKMedium{Noto Serif CJK SC Medium}[CharacterWidth=Full]
\newCJKfontfamily\NotoSerifCJKSemiBold{Noto Serif CJK SC SemiBold}[CharacterWidth=Full]
\newCJKfontfamily\NotoSerifCJKBold{Noto Serif CJK SC Bold}[CharacterWidth=Full]
\newCJKfontfamily\NotoSerifCJKBlack{Noto Serif CJK SC Black}[CharacterWidth=Full]
\newCJKfontfamily\NotoSansCJKThin{Noto Sans CJK SC Thin}[CharacterWidth=Full]
\newCJKfontfamily\NotoSansCJKLight{Noto Sans CJK SC Light}[CharacterWidth=Full]
\newCJKfontfamily\NotoSansCJKDemiLight{Noto Sans CJK SC DemiLight}[CharacterWidth=Full]
\newCJKfontfamily\NotoSansCJKRegular{Noto Sans CJK SC}[CharacterWidth=Full]
\newCJKfontfamily\NotoSansCJKMedium{Noto Sans CJK SC Medium}[CharacterWidth=Full]
\newCJKfontfamily\NotoSansCJKBold{Noto Sans CJK SC Bold}[CharacterWidth=Full]
\newCJKfontfamily\NotoSansCJKBlack{Noto Sans CJK SC Black}[CharacterWidth=Full]
\newCJKfontfamily\lanting{方正兰亭黑Pro_GB18030 Heavy.otf}

\newif\iffixpozhehao
\fixpozhehaofalse
%\fixpozhehaotrue

\iffixpozhehao
\makeatletter
\ExplSyntaxOn
% 最好能将 \l_@@_pozhehao_ligature_bool 的设置放进 \xeCJKsetup
% 键的名称可以是 PoZheHaoLigature
\bool_new:N \l__xeCJK_pozhehao_ligature_bool
\bool_set_false:N \l__xeCJK_pozhehao_ligature_bool
% 新建 PoZheHao 字符类别: 
% 跟 FullRight 与其余类别的关系一致,
% 只是自身类别的符号之间不加任何东西.
\xeCJK_new_class:n { PoZheHao }
\__xeCJK_save_CJK_class:n { PoZheHao }
\seq_map_inline:Nn \g__xeCJK_class_seq
  {
    \str_if_eq:nnF {#1} { PoZheHao }
      {
        \xeCJK_copy_inter_class_toks:nnnn { PoZheHao } {#1} { FullRight } {#1}
        \xeCJK_copy_inter_class_toks:nnnn {#1} { PoZheHao } {#1} { FullRight }
      }
  }
% 保证 PoZheHao 类别能被 \@@_punct_if_right:N 判定为 FullRight
\prg_set_conditional:Npnn \__xeCJK_punct_if_right:N #1 { p , T , F , TF }
  {
    \if_int_compare:w \xeCJK_token_value_class:N #1 =
                      \xeCJK_class_num:n { FullRight }
      \prg_return_true:
    \else:
      \if_int_compare:w \xeCJK_token_value_class:N #1 =
                        \xeCJK_class_num:n { PoZheHao }
        \prg_return_true:
      \else:
        \prg_return_false:
      \fi:
    \fi:
  }
% 用户命令 (最好作为 \xeCJKsetup 的键值),
% 将 U+2014 与 U+2015 放入 PoZheHao 类别中.
\NewDocumentCommand \UsePoZheHaoLigature { }
  {
    \bool_set_true:N \l__xeCJK_pozhehao_ligature_bool
    \xeCJK_declare_char_class:nn { PoZheHao } { "2014 , "2015 }
  }
% 改进破折号中间负 kerning 的计算方法
\cs_set_protected_nopar:Npn \__xeCJK_long_punct_kerning:N #1
  {
    % 取 Max( width - dimen, dimen + width - 2*字号, 2*width - 2*字号 )
    % 作为中间 kerning 的依据.
    %   1. width - dimen 用于解决中易系列字体;
    %   2. dimen + width - 2*字号 用于解决方正兰亭黑;
    %   3. 2*width - 2*字号 用于解决微软雅黑.
    % 如果 width、dimen 与字号偏差不大, 这种方法可行.
    % 如果 width、dimen 与字号偏差太大, 说明字库的设计有问题.
    \dim_set:Nn \l__xeCJK_tmp_dim
      {
        \dim_max:nn
          { \l__xeCJK_bound_dim + \l__xeCJK_reverse_bound_dim }
          {
            \dim_max:nn
              {
                \tex_dimexpr:D
                  \__xeCJK_use_punct_dim:nN { dimen } #1 +
                  \__xeCJK_use_punct_dim:nN { width } #1 -
                  \f@size pt - \f@size pt
                \scan_stop:
              }
              {
                2
                \tex_dimexpr:D
                  \__xeCJK_use_punct_dim:nN { width } #1 -
                  \f@size pt
                \scan_stop:
              }
          }
      }
    % 只有相邻两个 U+2014 之间需要 kerning,
    % 两个 U+2E3A、U+2025、U+2026 之间都没有必要.
    \dim_set:Nn \l__xeCJK_tmp_dim
      {
        \str_case:nnTF {#1}
          { { ^^^^2014 } { } }
          { -\l__xeCJK_tmp_dim }
          { \c_zero_dim }
      }
    \__xeCJK_save_punct_dim:nNNn  { kern } #1 #1 { \l__xeCJK_tmp_dim }
    \__xeCJK_save_punct_skip:nNNn { kern } #1 #1 { \l__xeCJK_tmp_dim }
    % 其余各值保持原样
    \dim_set:Nn \l__xeCJK_tmp_dim
      {
        \dim_max:nn
          { \l__xeCJK_bound_dim + \l__xeCJK_reverse_bound_dim }
          { \c_zero_dim }
      }
    \__xeCJK_save_punct_dim:nNNn { bound_width } #1 #1 { \l__xeCJK_tmp_dim }
    \dim_set:Nn \l__xeCJK_tmp_dim
      {
        \str_case:nnTF {#1}
          { { ^^^^2025 } { } { ^^^^2026 } { } }
          { \c_zero_dim }
          { -\l__xeCJK_tmp_dim }
      }
    \dim_add:Nn \l__xeCJK_tmp_dim
      { \dim_max:nn { \l__xeCJK_bound_dim } { \c_zero_dim } }
    \__xeCJK_save_punct_dim:nNNn  { bound_kern } #1 #1 { \l__xeCJK_tmp_dim }
    \__xeCJK_save_punct_skip:nNNn { bound_kern } #1 #1 { \l__xeCJK_tmp_dim }
  }
% 改进居中标点两端补空白的计算方法
\cs_set_protected_nopar:Npn \xeCJK_punct_margin_process:NN #1#2
  {
    \dim_set:Nn \l__xeCJK_tmp_dim
      {
        \bool_if:NTF \l__xeCJK_enabled_global_setting_bool
          {
            \cs_if_exist_use:cF { g__xeCJK_punct_width/#2/tl }
              {
                \tl_if_empty:NTF \g__xeCJK_punct_width_tl
                  { \__xeCJK_calc_punct_width:N #2 }
                  { \g__xeCJK_punct_width_tl }
              }
          }
          { \__xeCJK_calc_punct_width:N #2 }
      }
    \dim_set:Nn \l__xeCJK_tmp_dim
      {
        \dim_max:nn
          { \l__xeCJK_margin_minimum_dim }
          {
            \dim_compare:nNnTF \l__xeCJK_tmp_dim < \c_max_dim
              {
                \__xeCJK_punct_if_middle:NTF #2
                  {
                    % 共享的部分
                    (   \l__xeCJK_tmp_dim
                      - ( \__xeCJK_use_punct_dim:nN { dimen } #2 )
                    )
                    % 根据 \l_@@_pozhehao_ligature_bool 分情况.
                    \bool_if:NTF \l__xeCJK_pozhehao_ligature_bool
                      {
                        % 破折号有合字功能, 两端各填补一半的空白.
                        / \c_two
                      }
                      {
                        % 破折号没有合字功能, 除 U+2014 以外的字符两端补一半.
                        % 在 U+2014 两端各填补一整个空白 (不需再额外计算),
                        % 优先保证破折号占两个字宽.
                        % 此时单独一个 U+2014 占 字宽 + 空白,
                        % 而连续三个 U+2014 占 3*字宽 - 空白.
                        \str_case:nnF {#2}
                          { { ^^^^2014 } { } }
                          {
                            / \c_two
                          }
                      }
                  }
                  {
                    \bool_if:NTF \l__xeCJK_optimize_margin_bool
                      {
                        \dim_max:nn
                          {
                            \dim_min:nn
                              { \l__xeCJK_bound_dim }
                              { \l__xeCJK_reverse_bound_dim }
                          }
                      }
                      { \use:n }
                      {
                          \l__xeCJK_tmp_dim
                        - \l__xeCJK_reverse_bound_dim
                        - ( \__xeCJK_use_punct_dim:nN { dimen } #2 )
                      }
                  }
              }
              {
                \bool_if:NTF \l__xeCJK_optimize_margin_bool
                  { \dim_min:nn { \l__xeCJK_bound_dim } }
                  { \use:n }
                  { \__xeCJK_calc_margin_width:N #2 }
              }
          }
      }
    \__xeCJK_save_punct_dim:nNNn { glue } #1 #2 { \l__xeCJK_tmp_dim }
    \__xeCJK_save_punct_skip:nNNnnn { glue } #1 #2
      { \l__xeCJK_tmp_dim }
      {
        \__xeCJK_punct_if_middle:NTF #2
          {
            ( \__xeCJK_use_punct_dim:nN { width } #2 -
              \__xeCJK_use_punct_dim:nN { dimen } #2 ) / \c_two
            - \l__xeCJK_tmp_dim
          }
          { \l__xeCJK_bound_dim - \l__xeCJK_tmp_dim }
      }
      {
        \__xeCJK_punct_if_middle:NTF #2
          { .5 \l__xeCJK_tmp_dim }
          { \l__xeCJK_tmp_dim - \l__xeCJK_reverse_bound_dim }
      }
  }
\ExplSyntaxOff
\makeatother
\fi

\usepackage{mathtools}
\usepackage{unicode-math}
\usepackage{xcolor}
\usepackage{booktabs}
\usepackage{geometry}
\geometry{
  a4paper,width=420bp
}
\newcommand*\header[1]{\multicolumn{1}{c}{#1}}
\newcommand*\ideographicbaseline{-0.140625}
\newcommand*\ccbox[1][1]{%
  \leavevmode\smash{%
    \color{blue}%
    \setlength\fboxrule{0.05pt}%
    \setlength\fboxsep{-\fboxrule}%
    \fbox{%
      \rule[\ideographicbaseline\ccwd]{0pt}{\ccwd}%
      \rule{#1\ccwd}{0pt}%
    }%
  }%
}
\iffixpozhehao
  \newcommand*\ccoutput[3]{%
    Fixed & #1%
%    \rlap{\ccbox\ccbox\ccbox[#2]\ccbox\ccbox}%
    中文——中文%
    \ignorespaces
  }%
\else
  \newcommand*\ccoutput[3]{%
    #3 & #1%
    \rlap{\ccbox\ccbox\ccbox[#3]\ccbox\ccbox}%
    中文——中文%
    \ignorespaces
  }%
\fi
\newcommand*\fakefootnotei{\textsuperscript1\ignorespaces}
\newcommand*\fakefootnoteii{\textsuperscript2\ignorespaces}
\newcommand*\fakefootnoteiii{\textsuperscript3\ignorespaces}
\newcommand*\fakefootnoteiv{\textsuperscript4\ignorespaces}
\newcommand*\fakefootnotev{\textsuperscript5\ignorespaces}
\newcommand*\fakefootnotevandvi{\textsuperscript{5,6}\ignorespaces}
\newcommand*\fakefootnotevii{\textsuperscript7\ignorespaces}
\newcommand*\fakefootnoteviii{\textsuperscript8\ignorespaces}
\newcommand*\fakefootnoteviiiandix{\textsuperscript{8,9}\ignorespaces}

\begin{document}
\noindent
\begin{minipage}{\textwidth}
\begin{tabular}{l r r r r l l}
\toprule
 \header{字体名称} &
 \header{UPE\fakefootnotei} &
 \header{bbwd\fakefootnoteii} &
 \header{LSB\fakefootnoteiii} &
 \header{RSB\fakefootnoteiv} &
 \header{破折号/字框} &
 \header{输出效果} \\
\midrule
 中易宋/仿 &
   256 & 256 & 8 & 7 & \ccoutput{\fangsong}{2}{1.94140625}
     \fakefootnotev \\
 中易黑/楷 &
   256 & 256 & 0 & 1 & \ccoutput{\kaishu}{2}{1.99609375}
     \fakefootnotev \\
 中易隶书\gdef\ideographicbaseline{-0.17578125} &
   256 & 256 & 74 & 10 & \ccoutput{\lishu}{2}{1.671875}
     \iffixpozhehao \fakefootnotev \else \fakefootnotevandvi \fi \\
 中易幼圆\gdef\ideographicbaseline{-0.17578125} &
   256 & 256 & 36 & 10 & \ccoutput{\youyuan}{2}{1.8203125}
     \iffixpozhehao \fakefootnotev \else \fakefootnotevandvi \fi \\
\cmidrule{1-7}
 微软雅黑\gdef\ideographicbaseline{-0.15} &
   2048 & 2212 & 0 & 0 & \ccoutput{\yahei}{2}{2.16015625}
     \iffixpozhehao \fakefootnotev \else \fakefootnotevandvi \fi \\
\cmidrule{1-7}
 方正兰亭黑 Heavy\gdef\ideographicbaseline{-0.15} &
   1000 & 1000 & $-9$ & $-8$ & \ccoutput{\lanting}{2}{2.017}
     \iffixpozhehao \fakefootnotev \else \fakefootnotevii \fi \\
\cmidrule{1-7}
 思源黑体 Thin\gdef\ideographicbaseline{-0.12} &
   1000 & 881 & 44 & 45 & \ccoutput{\NotoSansCJKThin
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.911}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviiiandix \fi \\
 思源黑体 Light &
   1000 & 886 & 45 & 45 & \ccoutput{\NotoSansCJKLight
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.91}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviiiandix \fi \\
 思源黑体 DemiLight &
   1000 & 892 & 46 & 46 & \ccoutput{\NotoSansCJKDemiLight
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.908}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviiiandix \fi \\
 思源黑体 Regular &
   1000 & 894 & 46 & 46 & \ccoutput{\NotoSansCJKRegular
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.908}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviiiandix \fi \\
 思源黑体 Medium &
   1000 & 900 & 47 & 48 & \ccoutput{\NotoSansCJKMedium
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.905}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviiiandix \fi \\
 思源黑体 Bold &
   1000 & 908 & 49 & 49 & \ccoutput{\NotoSansCJKBold
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.902}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviiiandix \fi \\
 思源黑体 Black &
   1000 & 915 & 50 & 50 & \ccoutput{\NotoSansCJKBlack
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.9}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviiiandix \fi \\
 思源宋体 ExtraLight &
   1000 & 873 & 43 & 43 & \ccoutput{\NotoSerifCJKExtraLight
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.914}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviii \fi \\
 思源宋体 Light &
   1000 & 877 & 43 & 44 & \ccoutput{\NotoSerifCJKLight
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.913}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviii \fi \\
 思源宋体 Regular &
   1000 & 882 & 43 & 44 & \ccoutput{\NotoSerifCJKRegular
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.913}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviii \fi \\
 思源宋体 Medium &
   1000 & 890 & 44 & 44 & \ccoutput{\NotoSerifCJKMedium
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.912}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviii \fi \\
 思源宋体 SemiBold &
   1000 & 905 & 44 & 44 & \ccoutput{\NotoSerifCJKSemiBold
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.912}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviii \fi \\
 思源宋体 Bold &
   1000 & 925 & 45 & 45 & \ccoutput{\NotoSerifCJKBold
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.91}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviii \fi \\
 思源宋体 Black &
   1000 & 948 & 45 & 45 & \ccoutput{\NotoSerifCJKBlack
   \iffixpozhehao\UsePoZheHaoLigature\fi}{2}{1.91}
     \iffixpozhehao \fakefootnotevandvi \else \fakefootnoteviii \fi \\
\bottomrule
\end{tabular}
\xeCJKsetup{CJKecglue=}%
\footnotesize
\rule{0pt}{\ht\strutbox}%
\fakefootnotei
Units per em.

\fakefootnoteii
Bounding box width.

\fakefootnoteiii
Left side-bearing.

\fakefootnoteiv
Right side-bearing.

\iffixpozhehao
\fakefootnotev
全部对齐了! 而且全部占两个字宽!

\textsuperscript6\ignorespaces
此时通过字符类别 \texttt{PoZheHao} 来启用 OpenType 的合字功能.
\else
\fakefootnotev
计算破折号宽与字框宽比值的公式为:
\[
\frac{\text{破折号占字宽}}{\text{字框宽}}
 = \frac{2\times\text{bbwd}-(\text{LSB}+\text{RSB})}{\text{UPE}}.
\]

\textsuperscript6\ignorespaces
字体参数表中的汉字底线值不可靠.

\textsuperscript7\ignorespaces
机制不同.
此时, 两个 \texttt{U+2014} 之间默认只有零 kerning, 而字面已经重叠 17 个单位.
又因为破折号两端边界不允许挤压, 所以多出来的 17 个单位只能加在已有的 2000 个
字框单位上去, 与第~5~条脚注给的公式正好得到相同的结果.

\fakefootnoteviii
因为用 \texttt{CharacterWidth=Full} 选择了全角标点,
此时, \texttt{U+2015} 的 bbwd 值为 1000,
计算破折号宽与字框宽比值的公式变成:
\[
\frac{\text{破折号占字宽}}{\text{字框宽}}
 = \frac{2\times\text{1000}-(\text{LSB}+\text{RSB})}{\text{1000}}.
\]

\textsuperscript9\ignorespaces
插在两个 \texttt{U+2014} 之间的负 kerning 是基于 \texttt{U+2014} 的
字面计算出来的.
但是, 思源黑体的全角标点 \texttt{U+2015} 有较大的 side-bearings,
其字面宽 850～860, 左右空白 140～150. 而此时负的 kerning 不超过 100,
这就导致破折号中间仍有空白.
\fi
\end{minipage}
\end{document}

@RuixiZhang42 之前 https://github.com/CTeX-org/ctex-kit/issues/382#issuecomment-430873626 这里的代码我测试了一下，发现破折号出现在其他标点之后就会报错：示例文字改用 爱。——，报的错误是

! Missing number, treated as zero.
<to be read again> 
                   \c__xeCJK_xeCJK/SourceHanSerifSC(0)/m/n/10.53937/quanjiao...
l.26 爱。—
        —
? x

临时处理是可以放一个空盒子，但不知道改进之后还有没有问题呢？（我还没有试）

@Stone-Zeng xeCJK 果然水太深……

我用 \tracingall 试了一下爱。——这个例子，发现 xeCJK 在尝试着提取 U+2014 左边的 glue：\c__xeCJK_xeCJK/SourceHanSerifSC-Regular.otf(0)/m/n/10.53937/quanjiao/dim/glue/left/—/tl，但是这是不存在的，只有 U+2014 右边的 glue 是被算过的（对于这种居中对齐的标点，左右两侧加的 glue 是一样的，就没必要左右各算一次）。

所以还得欺骗 xeCJK 把 PoZheHao 类别真的当成 FullRight 才行。补丁要给“判断一个标点符号是否为全角右标点”的 \@@_punct_if_right:N 再打一下：

\prg_set_conditional:Npnn \__xeCJK_punct_if_right:N #1 { p , T , F , TF }
  {
    \if_int_compare:w \xeCJK_token_value_class:N #1 =
                      \xeCJK_class_num:n { FullRight }
      \prg_return_true:
    \else:
      \if_int_compare:w \xeCJK_token_value_class:N #1 =
                        \xeCJK_class_num:n { PoZheHao }
        \prg_return_true:
      \else:
        \prg_return_false:
      \fi:
    \fi:
  }

这种用法至少不会报错，只是句号与破折号之间 kerning 过大（摔！）。

其实，整个关于中间 kerning、两端补空白、与其它标点之间 glue 的算法似乎完全不适用于思源字体……添加 PoZheHao 类别的方法是治标不治本，多亏了思源西文部分的 U+2014 左右空白大多是相等的，不相等的时候破折号错位千分之 0.5，肉眼根本看不出来。根据西文字面算出来的 bound_kern、bound_rule、rule、glue 其实都是错的，真要改起来……工作量不敢想……

目前 xeCJK 中对破折号的处理确实是比较粗糙的，仅仅是保证破折号中间不会出现空白的情况，没有考虑到要占两个字宽和新字体的合字功能。大致看了一下上面的新算法，思路肯定是正确的，只需要处理一些实现的细节。

Type is Beautiful 最近的文章《不离不弃的破折号》比较细致地分析了破折号的各种问题，引起我兴趣的是其中提到的一个「曲线救国」的解决方案。日文排版专家大石先生在这篇博文中建议使用「一个被水平拉长一倍的 U+2015」作破折号，因为日本字厂一般不把 U+2015 做顶格。大陆字厂则一般不把 U+2014 做顶格，因此也可以考虑相似的处理方法：

当读入两个连续的 U+2014 时，xeCJK 可以将第一个 U+2014 拉长到两个字宽，然后将第二个 U+2014「吞掉」。

还得想想怎么应付 PDF 复制粘贴的问题就是了……

@stone-zeng 终于搞清楚如何正确地使用 locl 特性了，按理来说它是默认开启的，后来得知应该根据需要指定 Script 与 Language。注意，对于思源系列，除了破折号这些存在字形替换，数字也有替换（官方 readme 文件里几乎看不出来，在中文环境下，数字高与大写字高相等）：

\documentclass{article}
\usepackage{xcolor}
\usepackage{fontspec}
\setmainfont{SourceHanSansSC-Regular.otf}
\newcommand\test{\char"8FD4 \char"2014 E567F\char"2E3A }
\begin{document}
\test\llap{\color{red}\rule[0.734em]{6em}{0.05pt}}\par
\addfontfeatures{Script=CJK Ideographic,Language=Chinese Simplified}
\test\llap{\color{red}\rule[0.734em]{6em}{0.05pt}}
\end{document}

复制粘贴的话，好像还是得在 XeLaTeX 下声明 \XeTeXgenerateactualtext=1 才行。见 https://tex.stackexchange.com/q/488619

Script 与 Language 在什么时候应该用哪个有什么建议吗？官方文档好像也没有说得很清楚。

方正兰亭圆简体等一些方正系的字体也是，有没有统一的处理？我看到中易宋体应该是处理过的，它原来是断开的。


\documentclass[a4paper,fontset=none]{article}
\usepackage{ctex}
\usepackage{graphicx}
\usepackage[inner=2cm,outer=2cm,top=2.5cm,bottom=2.25cm]{geometry}
\usepackage{indentfirst}
\setlength{\parindent}{2em}

\setCJKmainfont{宋体}[BoldFont=Noto Serif CJK SC Bold]
\setCJKsansfont{Noto Sans CJK SC}
\setCJKmonofont{Noto Sans Mono CJK SC}
\newCJKfontfamily\songti{宋体}[BoldFont=Noto Serif CJK SC Bold]
\newCJKfontfamily\heiti{Noto Sans CJK SC}
\newCJKfontfamily\kaishu{楷体}
\newCJKfontfamily\fangsong{仿宋}
\newCJKfontfamily\lishu{方正隶书简体}
\newCJKfontfamily\yahei{Noto Sans CJK SC}
\newCJKfontfamily\youyuan{方正兰亭圆简体}

\begin{document}\Large

\section*{字体说明}

这是一款自己配置的字库，预览如下：

\begin{itemize}
    \item[1.] \songti{我能吞下玻璃而不伤身体。，、——：；‘“songti = 宋体”’！？}

    \item[2.] \heiti{我能吞下玻璃而不伤身体。，、——：；‘“heiti = Noto Sans CJK SC”’！？}

    \item[3.] \kaishu{我能吞下玻璃而不伤身体。，、——：；‘“kaishu = 楷体”’！？}

    \item[4.] \fangsong{我能吞下玻璃而不伤身体。，、——：；‘“fangsong = 仿宋”’！？}

    \item[5.] \lishu{我能吞下玻璃而不伤身体。，、——：；‘“lishu = 方正隶书简体”’！？}

    \item[6.] \yahei{我能吞下玻璃而不伤身体。，、——：；‘“yahei = Noto Sans CJK SC”’！？}

    \item[7.] \youyuan{我能吞下玻璃而不伤身体。，、——：；‘“youyuan = 方正兰亭圆简体”’！？}
\end{itemize}

Current Issues:

微软默认的“幼圆”字体会报“Font "幼圆" does not contain requested Script "CJK".”，但是中文破折号不会断开，而方正兰亭圆简体不会报错，但是破折号会断开。

\end{document}

方正兰亭圆简体等一些方正系的字体也是

据我所知只有思源做了 U+2E3A，方正这些字体属于传统解法，就是直接用两个 U+2014。xeCJK 会做压缩把两个 glyph 拼起来，但显然对于圆体这种有了倒角操作的东西就不起作用了。

哪里可以设置压缩比例啊？对圆体我就调一下让它压缩得更多算了……

另外

\usepackage{fontspec}
\setmainfont{Noto Sans CJK SC}
\setsansfont{Noto Sans CJK SC}
\setmonofont{JetBrains Mono}
\usepackage[fontset=none]{ctex}
\setCJKmainfont{宋体}[BoldFont=Noto Serif CJK SC Bold]
\setCJKsansfont{Noto Sans CJK SC}
\setCJKmonofont{Noto Sans Mono CJK SC}
\newCJKfontfamily\songti{宋体}[BoldFont=Noto Serif CJK SC Bold]
\newCJKfontfamily\heiti{Noto Sans CJK SC}
\newCJKfontfamily\kaishu{楷体}
\newCJKfontfamily\fangsong{仿宋}
\newCJKfontfamily\lishu{方正隶书简体}
\newCJKfontfamily\yahei{Noto Sans CJK SC}
\newCJKfontfamily\youyuan{方正兰亭圆简体}

这里的Noto还是用的两个U+2014，体现为破折号高度偏低，是Noto和SourceHan的区别吗

我看了一下#444但是我这边不能这么用，毕竟还有其它字体，比如宋体那一行就会报错（Illegal unit of measure）

但是它的确用的是U+2E3A，因为这个和我能吞下玻璃而不伤身体\symbol{"2E3A}。，、：；‘“heiti = Noto Sans CJK SC”’！？是一样的显示效果。我已经反映给Noto CJK了。

Unicode 里面所谓破折号没有西文 / CJK 的区别，所以 U+2014 / U+2E3A 实际上对应了多个 glyphs。具体如下图所示：

直接打出来用的是 western glyph（至于引号，因为 LaTeX 本身就用 ``...''，恰好回避了码位共用的问题），因此它的高度是有问题的。要使用 CJK glyph 需要打开有关 OpenType 特性，比如上面提到的 fwid 或者 locl。

\documentclass[a4paper,fontset=none]{article}
\usepackage{ctex}

\setCJKmainfont{宋体}[BoldFont=Noto Serif CJK SC Bold]
\setCJKsansfont{Noto Sans CJK SC}[RawFeature=+fwid]
\setCJKmonofont{Noto Sans Mono CJK SC}[RawFeature=+fwid]
\newCJKfontfamily\songti{宋体}[BoldFont=Noto Serif CJK SC Bold]
\newCJKfontfamily\heiti{Noto Sans CJK SC}[RawFeature=+fwid]
\newCJKfontfamily\kaishu{楷体}
\newCJKfontfamily\fangsong{仿宋}
\newCJKfontfamily\lishu{方正隶书简体}
\newCJKfontfamily\yahei{Noto Sans CJK SC}[RawFeature=+fwid]
\newCJKfontfamily\youyuan{方正兰亭圆简体}

\begin{document}\Large

\section*{字体说明}

这是一款自己配置的字库，预览如下：

\begin{itemize}
    \item[1.] \songti{我能吞下玻璃而不伤身体。，、——：；‘“songti = 宋体”’！？——“”}

    \item[2.] \heiti{我能吞下玻璃而不伤身体。，、\symbol{"2E3A}：；‘“songti = 宋体”’！？——“”}

    \item[3.] \kaishu{我能吞下玻璃而不伤身体。，、——：；‘“kaishu = 楷体”’！？}

    \item[4.] \fangsong{我能吞下玻璃而不伤身体。，、——：；‘“fangsong = 仿宋”’！？}

    \item[5.] \lishu{我能吞下玻璃而不伤身体。，、——：；‘“lishu = 方正隶书简体”’！？}

    \item[6.] \yahei{我能吞下玻璃而不伤身体。，、\symbol{"2E3A}：；‘“songti = 宋体”’！？——“”}

    \item[7.] \youyuan{我能吞下玻璃而不伤身体。，、——：；‘“youyuan = 方正兰亭圆简体”’！？}
\end{itemize}

\noindent Current Issues:

\begin{itemize}
    \item 微软默认的“幼圆”字体会报错“does not contain requested Script CJK.”，但是中文破折号不会断开，而方正兰亭圆简体不会报错，但是破折号会断开（实际上是xeCJK将两个横线合并时没有完全合并）。
    \item 涉及Noto系列字体时，中文破折号要用\texttt{\textbackslash symbol{"2E3A}}代替“——”（西文的em dash），这两个符号的高度不一样。
\end{itemize}
\end{document}

现在的解决方案是这样，谢谢！

CTeX-org / ctex-kit