Open ehehela opened 1 year ago
I want to mention that any dumb auto converter solution is not reliable, since Hanzi is used in many Asian countries and territories, in many variants. For example, Hanzi 【出来】 maybe chu lai in Chinese, or de ki in Japanese, or xuý lãi in Vietnamese, cheok rae in Korean.
In my workflow, calibre is using a dumb converter, converting my Chinese items into ugly "ASCII equivalents" which it believes to be so. Sometimes in Chinese (Mandarin or Cantonese), sometimes Japanese...
So I suggest that let user decide which romanization to be used.
Implementation note: Use https://github.com/houbb/pinyin. I found it via https://search.maven.org/search?q=pinyin4j - and it seems to be the most maintained one. If that does not work, try https://github.com/belerweb/pinyin4j.
I checked https://sourceforge.net/p/pinyin4j/news/ and think, the library needs to be configured to (in bold)
I would again state that it may not be a good idea to do so naively in main program, since one cannot distinguish Chinese among other CJK languages easily without natives or AI, and will definitely BREAK other users' experience, especially Japanese users which uses Hanzi (Kanji in Japanese) too but totally different romanization. Hanzi is not only the character of Chinese, but also the character of all CJK languages across Asia, just like Latin characters we are using now. It is a complicated system. Low priority of this issue is suggested. Localization checks are not reliable, since multiple languages should be allowed in a single library. (I have seen this in calibre and I don't want to see this happen here breaking all my bibliography...)
BibTeX allows for using the field "language" to indicate the language of the entry. Maybe, one could use that as input for the citation key generator.
@clzls I assume you are on UTF8 and use the non-ASCII characters also for your citation keys?
The issue is complicated with many different user "profiles". Maybe we need a preference?
The implementation is complicated since we need to accommodate users with different languages while ensuring a smooth user experience for those accustomed to the current system.
Maybe we could offer users the option to enable this function (default off). Other preconditions are also needed such as ensuring that romanization only occurs when a valid "language" field is specified (as mentioned by kropper).
Perhaps we can extend romanization support not only to Chinese language but also other languages (Korean, Japanese, etc.). Allow different "language" fields to use distinct romanization schemes. Therefore, a consistent interface with customization options would be beneficial, allowing users to decide which "language" fields to romanize.
Alternatively, a semi-automatic approach could suffice. We could introduce new options in the right-click menu (see figure below). We could also use “check integrity” to collect those entries with non-ASCII citation keys into one group, followed by “cleanup entries” for this group (see figure below). Thus, this method can also make it convenient for people who are in need.
@clzls I assume you are on UTF8 and use the non-ASCII characters also for your citation keys?
Yes I do, and a bunch of my papers were written in Chinese, using tons of packages to tweak LaTeX compilers, to make them happy dealing with non-ASCII characters... (no one would write papers full of something like \symbol{28450}\symbol{35486}
, I guess)
Perhaps we can extend romanization support not only to Chinese language but also other languages (Korean, Japanese, etc.). Allow different "language" fields to use distinct romanization schemes. Therefore, a consistent interface with customization options would be beneficial, allowing users to decide which "language" fields to romanize.
Looks good for me. By implementing this way, it is like an extension to opt-in and extensible for any language that has needs to obtain ASCII equivalents (even Europeans may need it, such as Danish or Greek, I think). I would go even further and suggest that introducing dynamic-loadable custom formatters may be an even better solution, so that everyone would be happy...
At a LaTeX conference, I learned form the LaTeX developers that it is now also possible to use Unicode with pdflatex
and labels. E.g., \label{sec:grüße}
. Moreover, it is also possible for citaton keys.
Does it work for BibTeX too? (like \cite{grüße})
Does it work for BibTeX too? (like \cite{grüße})
According to the LaTeX 3 team: Yes. Just ensure that you run latest TeXLive 😅
I use the latest MiKTeX, BibTeX and pdflatex, the citation key like \cite{grüße}
works correctly, but the citation key like \cite{任政2018}
causes error.
The error message is:
! Undefined control sequence. \GenericError ...
! Emergency stop. \GenericError ...
@ehehela I asked LaTeX pros. It works on TeXLive. See https://chat.stackexchange.com/transcript/message/65511308#65511308
OK, it seems, some more "magic" is needed:
\documentclass{article}
\DeclareUnicodeCharacter{4EFB}{CJK Ideograph 4efb}
\DeclareUnicodeCharacter{653F}{CJK Ideograph 653f}
\begin{document}
\cite{任政2018}
\begin{thebibliography}{99}
\bibitem{任政2018} xxxx
\end{thebibliography}
\end{document}
actually that resolves the error but the cite doesn't work it doesn't need the definitions but it does (currently) need something safe as the first token
\documentclass{article}
\begin{document}
\cite{ 任政2018}
\cite{x任政2018}
\begin{thebibliography}{99}
\bibitem{任政2018} xxxx
\bibitem{x任政2018} xxxx
\end{thebibliography}
\end{document}
Although the official position is that cite keys should use ascii characters,
@koppor and @davidcarlisle Thank you.
I have test two cases with pdflatex
+bibtex
+article
: the first one also includes ctex
package to enable Chinese support while the second one not.
The results show that the first test only works for ascii citation key with bibliography in Chinese, but the second test works for non-ascii citation key with bibliography in English.
Therefore, I think pdflatex
+bibtex
may not fully support non-ascii characters. xelatex
+biber
works.
The source code of the first test is:
\documentclass{article}
\usepackage{ctex}
\begin{filecontents*}{ref.bib}
@article{陈骁2012,
title={afdrgwfdsa},
author={sdfas and sadfsd and afsa and dasf},
journal={asdfsd},
volume={27},
number={2},
pages={133--138},
year={2012}
}
@article{chen2012,
title={基于电无级变速器的内燃机最优控制策略及整车能量管理},
author={陈骁 and 黄声华 and 万山明 and 庞珽},
journal={电工技术学报},
volume={27},
number={2},
pages={133--138},
year={2012}
}
\end{filecontents*}
\bibliographystyle{plain}
\begin{document}
\cite{ chen2012} % works for article class with ctex package and Chinese bibliography with ascii citation key
%\cite{ 陈骁2012} % works for article class and English bibliography with non-ascii citation key
\bibliography{ref}
\end{document}
The compilation result is:
The source code of the second test is:
\documentclass{article}
%\usepackage{ctex}
\begin{filecontents*}{ref.bib}
@article{陈骁2012,
title={afdrgwfdsa},
author={sdfas and sadfsd and afsa and dasf},
journal={asdfsd},
volume={27},
number={2},
pages={133--138},
year={2012}
}
@article{chen2012,
title={基于电无级变速器的内燃机最优控制策略及整车能量管理},
author={陈骁 and 黄声华 and 万山明 and 庞珽},
journal={电工技术学报},
volume={27},
number={2},
pages={133--138},
year={2012}
}
\end{filecontents*}
\bibliographystyle{plain}
\begin{document}
%\cite{ chen2012} % works for article class with ctex package and Chinese bibliography with ascii citation key
\cite{ 陈骁2012} % works for article class and English bibliography with non-ascii citation key
\bibliography{ref}
\end{document}
The compilation result is:
Therefore, I think
pdflatex
+bibtex
may not fully support non-ascii characters.xelatex
+biber
works.
FYI: My thesis is using xelatex
+bibtex
and citation keys with CJK characters work fine. ctexbook
and a brunch of other packages are used as it is a production env.
As for JabRef, I am not yet sure what would be the best option to have the correct language in the entry preview, but when it comes to rendering the entry in LaTeX, there seems to be a limitation of pdflatex that can be worked around with xelatex, special commands/syntax or other packages.
Are you aware of Babel?
There is also LuaLaTeX.
@ThiloteE lualatex is the way to go :). pdflatex and xelatex should only be used if absolutely necessary :)
Idea: Maybe, some of the Apache Lucene functionality can be used. There are these FoldingFilter
s. We used some of them in our LatexAwareAnalyzer
. (The AsciiFoldingFilter)
Here, an alternative citation key generation scheme is recommended for Chinese bibliography: using Chinese pinyin of authors rather Chinese character which is non-ASCII.
For example: WanZheng2016 or WanZ2016 is preffered rather than the default 万征2016.