JabRef / jabref

Graphical Java application for managing BibTeX and biblatex (.bib) databases
https://devdocs.jabref.org
MIT License
3.58k stars 2.5k forks source link

The citation key displays Chinese pinyin instead of Chinese character #9605

Open ehehela opened 1 year ago

ehehela commented 1 year ago

Here, an alternative citation key generation scheme is recommended for Chinese bibliography: using Chinese pinyin of authors rather Chinese character which is non-ASCII.

For example: WanZheng2016 or WanZ2016 is preffered rather than the default 万征2016.

@Article{万征2016, author = {万征 and 姚仰平 and 孟达}, journal = {力学学报}, title = {复杂加载下混凝土的弹塑性本构模型}, year = {2016}, issn = {0459-1879}, month = jun, number = {05}, pages = {1159--1171}, volume = {48}, }

clzls commented 9 months ago

I want to mention that any dumb auto converter solution is not reliable, since Hanzi is used in many Asian countries and territories, in many variants. For example, Hanzi 【出来】 maybe chu lai in Chinese, or de ki in Japanese, or xuý lãi in Vietnamese, cheok rae in Korean.

In my workflow, calibre is using a dumb converter, converting my Chinese items into ugly "ASCII equivalents" which it believes to be so. Sometimes in Chinese (Mandarin or Cantonese), sometimes Japanese...

So I suggest that let user decide which romanization to be used.

koppor commented 6 months ago

Implementation note: Use https://github.com/houbb/pinyin. I found it via https://search.maven.org/search?q=pinyin4j - and it seems to be the most maintained one. If that does not work, try https://github.com/belerweb/pinyin4j.

I checked https://sourceforge.net/p/pinyin4j/news/ and think, the library needs to be configured to (in bold)

clzls commented 6 months ago

I would again state that it may not be a good idea to do so naively in main program, since one cannot distinguish Chinese among other CJK languages easily without natives or AI, and will definitely BREAK other users' experience, especially Japanese users which uses Hanzi (Kanji in Japanese) too but totally different romanization. Hanzi is not only the character of Chinese, but also the character of all CJK languages across Asia, just like Latin characters we are using now. It is a complicated system. Low priority of this issue is suggested. Localization checks are not reliable, since multiple languages should be allowed in a single library. (I have seen this in calibre and I don't want to see this happen here breaking all my bibliography...)

koppor commented 6 months ago

BibTeX allows for using the field "language" to indicate the language of the entry. Maybe, one could use that as input for the citation key generator.

@clzls I assume you are on UTF8 and use the non-ASCII characters also for your citation keys?

The issue is complicated with many different user "profiles". Maybe we need a preference?

ehehela commented 6 months ago

The implementation is complicated since we need to accommodate users with different languages while ensuring a smooth user experience for those accustomed to the current system.

Maybe we could offer users the option to enable this function (default off). Other preconditions are also needed such as ensuring that romanization only occurs when a valid "language" field is specified (as mentioned by kropper).

Perhaps we can extend romanization support not only to Chinese language but also other languages (Korean, Japanese, etc.). Allow different "language" fields to use distinct romanization schemes. Therefore, a consistent interface with customization options would be beneficial, allowing users to decide which "language" fields to romanize.

Alternatively, a semi-automatic approach could suffice. We could introduce new options in the right-click menu (see figure below). We could also use “check integrity” to collect those entries with non-ASCII citation keys into one group, followed by “cleanup entries” for this group (see figure below). Thus, this method can also make it convenient for people who are in need.

屏幕截图 2024-04-01 185548 屏幕截图 2024-04-01 181912

clzls commented 6 months ago

@clzls I assume you are on UTF8 and use the non-ASCII characters also for your citation keys?

Yes I do, and a bunch of my papers were written in Chinese, using tons of packages to tweak LaTeX compilers, to make them happy dealing with non-ASCII characters... (no one would write papers full of something like \symbol{28450}\symbol{35486}, I guess)

Perhaps we can extend romanization support not only to Chinese language but also other languages (Korean, Japanese, etc.). Allow different "language" fields to use distinct romanization schemes. Therefore, a consistent interface with customization options would be beneficial, allowing users to decide which "language" fields to romanize.

Looks good for me. By implementing this way, it is like an extension to opt-in and extensible for any language that has needs to obtain ASCII equivalents (even Europeans may need it, such as Danish or Greek, I think). I would go even further and suggest that introducing dynamic-loadable custom formatters may be an even better solution, so that everyone would be happy...

koppor commented 6 months ago

At a LaTeX conference, I learned form the LaTeX developers that it is now also possible to use Unicode with pdflatex and labels. E.g., \label{sec:grüße}. Moreover, it is also possible for citaton keys.

mlep commented 5 months ago

Does it work for BibTeX too? (like \cite{grüße})

koppor commented 5 months ago

Does it work for BibTeX too? (like \cite{grüße})

According to the LaTeX 3 team: Yes. Just ensure that you run latest TeXLive 😅

ehehela commented 5 months ago

I use the latest MiKTeX, BibTeX and pdflatex, the citation key like \cite{grüße} works correctly, but the citation key like \cite{任政2018} causes error. The error message is:

! Undefined control sequence. \GenericError ...
! Emergency stop. \GenericError ...

koppor commented 5 months ago

@ehehela I asked LaTeX pros. It works on TeXLive. See https://chat.stackexchange.com/transcript/message/65511308#65511308

OK, it seems, some more "magic" is needed:

\documentclass{article}

\DeclareUnicodeCharacter{4EFB}{CJK Ideograph 4efb}
\DeclareUnicodeCharacter{653F}{CJK Ideograph 653f}

\begin{document}

\cite{任政2018}

\begin{thebibliography}{99}
\bibitem{任政2018} xxxx
\end{thebibliography}
\end{document}
davidcarlisle commented 5 months ago

actually that resolves the error but the cite doesn't work it doesn't need the definitions but it does (currently) need something safe as the first token

\documentclass{article}

\begin{document}

\cite{ 任政2018}

\cite{x任政2018}

\begin{thebibliography}{99}
\bibitem{任政2018} xxxx
\bibitem{x任政2018} xxxx
\end{thebibliography}
\end{document}

Although the official position is that cite keys should use ascii characters,

ehehela commented 5 months ago

@koppor and @davidcarlisle Thank you. I have test two cases with pdflatex+bibtex+article: the first one also includes ctex package to enable Chinese support while the second one not. The results show that the first test only works for ascii citation key with bibliography in Chinese, but the second test works for non-ascii citation key with bibliography in English. Therefore, I think pdflatex+bibtex may not fully support non-ascii characters. xelatex+biber works.

The source code of the first test is:

\documentclass{article}
\usepackage{ctex}

\begin{filecontents*}{ref.bib}
@article{陈骁2012,
  title={afdrgwfdsa},
  author={sdfas and sadfsd and afsa and dasf},
  journal={asdfsd},
  volume={27},
  number={2},
  pages={133--138},
  year={2012}
}
@article{chen2012,
  title={基于电无级变速器的内燃机最优控制策略及整车能量管理},
  author={陈骁 and 黄声华 and 万山明 and 庞珽},
  journal={电工技术学报},
  volume={27},
  number={2},
  pages={133--138},
  year={2012}
}
\end{filecontents*}

\bibliographystyle{plain}

\begin{document}

\cite{ chen2012}  % works for article class with ctex package and Chinese bibliography with ascii citation key
%\cite{ 陈骁2012}  % works for article class and English bibliography with non-ascii citation key

\bibliography{ref}
\end{document} 

The compilation result is: 屏幕截图 2024-04-17 091808

The source code of the second test is:

\documentclass{article}
%\usepackage{ctex}

\begin{filecontents*}{ref.bib}
@article{陈骁2012,
  title={afdrgwfdsa},
  author={sdfas and sadfsd and afsa and dasf},
  journal={asdfsd},
  volume={27},
  number={2},
  pages={133--138},
  year={2012}
}
@article{chen2012,
  title={基于电无级变速器的内燃机最优控制策略及整车能量管理},
  author={陈骁 and 黄声华 and 万山明 and 庞珽},
  journal={电工技术学报},
  volume={27},
  number={2},
  pages={133--138},
  year={2012}
}
\end{filecontents*}

\bibliographystyle{plain}

\begin{document}

%\cite{ chen2012}  % works for article class with ctex package and Chinese bibliography with ascii citation key
\cite{ 陈骁2012}  % works for article class and English bibliography with non-ascii citation key

\bibliography{ref}
\end{document}

The compilation result is: 屏幕截图 2024-04-17 092010

clzls commented 5 months ago

Therefore, I think pdflatex+bibtex may not fully support non-ascii characters. xelatex+biber works.

FYI: My thesis is using xelatex+bibtex and citation keys with CJK characters work fine. ctexbook and a brunch of other packages are used as it is a production env.

ThiloteE commented 5 months ago

As for JabRef, I am not yet sure what would be the best option to have the correct language in the entry preview, but when it comes to rendering the entry in LaTeX, there seems to be a limitation of pdflatex that can be worked around with xelatex, special commands/syntax or other packages.

Are you aware of Babel?

There is also LuaLaTeX.

koppor commented 5 months ago

@ThiloteE lualatex is the way to go :). pdflatex and xelatex should only be used if absolutely necessary :)

koppor commented 2 weeks ago

Idea: Maybe, some of the Apache Lucene functionality can be used. There are these FoldingFilters. We used some of them in our LatexAwareAnalyzer. (The AsciiFoldingFilter)