acl-org / acl-style-files

Official style files for papers submitted to venues of the Association for Computational Linguistics
748 stars 183 forks source link

Making Unicode supported LaTeX template as the default #7

Open thammegowda opened 2 years ago

thammegowda commented 2 years ago

We have been using PdfLatex compiler/engine as the default, but as we know it isn't Unicode (non-Latin) friendly. Though the instructions suggest using XeLaTeX, the generated PDF looks different in many ways than PdfLatex's. For example (left: PdfLatex, right: XeLatex): Look at the nuances in fonts, section headings aren't as bold as PdfTex's in the left. I believe the font weight isn't exactly the same.

Screen Shot 2022-01-04 at 2 55 02 PM

My request/suggestion: Move towards Unicode supported template as a way of encouraging NLP in non-Latin languages. Researchers working on non-Latin languages should also be able to paste qualitative examples (without some non-vector images), right? So, how about making Unicode supported template (i.e XeLatex) as the default?

If any one interested in testing unicode support of latex templates, here is a file having UDHR titles in hundreds of languages: udhr-title.txt

Thanks,

mbollmann commented 2 years ago

So the proceedings template contains these lines, which are really specific to pdfLaTeX and shouldn't be used with the newer engines:

\usepackage{times}
\usepackage{latexsym}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}

If I compile that on Overleaf, download the PDF, and check the fonts that are used with pdffonts, I get this:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
WFBODD+NimbusMonL-Regu               Type 1            Custom           yes yes yes     60  0
JKDAPA+NimbusRomNo9L-ReguItal        Type 1            Custom           yes yes yes     73  0
ARYBWX+NimbusRomNo9L-Medi            Type 1            Custom           yes yes yes     56  0
HPBZPC+NimbusRomNo9L-Regu            Type 1            Custom           yes yes yes     58  0
ALVSVR+NimbusSanL-Bold               Type 1            Custom           yes yes yes     59  0

So at least Overleaf uses the "Nimbus" fonts when including the "times" package. That makes me think that with XeLaTeX or LuaLaTeX, the above lines in the template should be replaced with:

\usepackage{fontspec}
\setmainfont{Nimbus Roman}
\setsansfont{Nimbus Sans}
\setmonofont{Nimbus Mono}

(EDIT: TeX Gyre Termes is probably better, since it's supposed to be the same but with more features.)

I think it would make sense to check in the .sty file which TeX engine is used, and modify the font-related commands accordingly. @davidweichiang Would it make sense if I tried to prepare a pull request for something like this?

davidweichiang commented 2 years ago

This all sounds great. But we need to set it up so that it looks the same either way.

davidweichiang commented 2 years ago

Also related if any pub chairs are still using it: https://github.com/yz-joey/ACLPUB/issues/7

davidweichiang commented 2 years ago

XeLaTeX has a major disadvantage, which is that arXiv does not support it. So I don't think it can be made the default (yet). But I definitely agree with making it an option.

@thammegowda In your example, the one on the right is set in Computer Modern, not Times Roman. So something is wrong with the font setup.

thammegowda commented 2 years ago

The modifications I did to add some Unicode text was

  1. enable babel

    \usepackage[english]{babel} % English as the main language
    \babelprovide[import]{hindi}
    \babelprovide[import]{arabic}
    \babelprovide[import]{kannada}
    \babelfont[*devanagari]{rm}{Lohit Devanagari}
    \babelfont[*arabic]{rm}{Noto Sans Arabic}
  2. Paste some Arabic and Hindi text

    Hindi: \foreignlanguage{hindi}{मानव अधिकारों की सार्वभौम घोषणा} Arabic: \foreignlanguage{arabic}{الإعلان العالمي لحقوق الإنسان
  3. And switch compiler to XeLaTex, since PdfTex could not compile it. Also, I had to comment out \pdfoutput=1 for XeLaTex

    I didn't explicitly modify fonts for English/Latin. Is babel import messing up default fonts for English? Sorry, I am not a *TeX pro. Here is my overleaf project for reference https://www.overleaf.com/project/61d4c64cbc3e72789d2de4bc

mbollmann commented 2 years ago

Well, I would say arXiv has a major disadvantage in that it doesn't support XeLaTeX/LuaLaTeX, but I can see how we should make sure to support it ;)

@thammegowda The default font is Computer Modern, to get the correct font for the current *ACL template, both \usepackage{times} and \usepackage[T1]{fontenc} are important.

thammegowda commented 2 years ago

@mbollmann I agree, and I hope arXiv realizes this shortcoming and makes an update.

Also, I have these two lines

\usepackage{times}
\usepackage[T1]{fontenc}

I didn't remove these two, but is XeLaTex using Computer Modern? That's surprising!

mbollmann commented 2 years ago

@thammegowda Ah, maybe it is overwritten by something else in your preamble then. I can't access your Overleaf project, it's restricted. Try to move the "times" import further down maybe?

thammegowda commented 2 years ago

@mbollmann I think babel package is causing the issue. If I move times fontenc and microtype below the babel, the fonts for latin look as intended, but Arabic and Hindi stop working (text doesn't even appear).

\usepackage[english]{babel} % English as the main language
\babelprovide[import]{hindi}
\babelprovide[import]{arabic}
\babelprovide[import]{kannada}
\babelfont[*devanagari]{rm}{Lohit Devanagari}
\babelfont[*arabic]{rm}{Noto Sans Arabic}

\usepackage{times}
\usepackage[T1]{fontenc}
\usepackage{microtype}

Here is a overleaf link: https://www.overleaf.com/read/vbyhzmssdkkb (worked for me in private/incognito) If we could share a working example with these text, it'd be very useful.

Hindi: मानव अधिकारों की सार्वभौम घोषणा
Arabic: الإعلان العالمي لحقوق الإنسان
mbollmann commented 2 years ago

@thammegowda Not an expert with Babel, but I think as soon as you use a \babelfont, you need to define an explicit Latin font as well. I haven't found a way to get the exact same font as LaTeX's ptm family (which is what "times" uses), but if you add

\babelfont{rm}{TeX Gyre Termes}

before you load the other, language-specific fonts, you get something virtually indistinguishable from it.

thammegowda commented 2 years ago

That works! Thanks.

venkatasg commented 1 year ago

I was just looking into whether there were efforts to move away from pdflatex to make the ACL style files more Unicode friendly - Glad I found this issue thread. I have 2 suggestions, and can help with the migration in these respects:

Further decisions probably need to be made about sans-serif and monospaced fonts, but none that can't be solved with some research.