More issues with escaped unicode

blegat commented 1 day ago

Follow up from https://github.com/JuliaDocs/DocumenterCitations.jl/issues/78

With the script

using DocumenterCitations
bib = CitationBibliography("bug.bib")
DocumenterCitations.format_bibliography_reference(bib.style, bib.entries["key"])

For

@misc{key,
  author = {{\"U}nl{\"u}, {\c C}a{\u g}lar},
}

I get

ERROR: LoadError: ArgumentError: Premature end of tex string: BoundsError("{\\c", 4)
Stacktrace:
  [1] tex_to_markdown(tex_str::SubString{String}; transform_case::Function, debug::Base.CoreLogging.LogLevel)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:135

When I try DocumenterCitations.tex_to_markdown(raw"{\"U}nl{\"u}, {\c C}a{\u g}lar"), I get "\"Unl\"u, Çağlar", which seems indeed weird because \"u" is not replaced by the unicode character.

With

@inproceedings{key,
  author = {Mikolov, Tom{\'a}{\v s}},
}

I get

caused by: BoundsError: attempt to access 11-codeunit SubString{String} at index [12]
Stacktrace:
  [1] checkbounds
    @ ./strings/basic.jl:216 [inlined]
  [2] getindex
    @ ./strings/substring.jl:100 [inlined]
  [3] _collect_group(tex_str::SubString{String}, i::Int64)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:407
  [4] _process_tex(tex_str::SubString{…}; transform_case::DocumenterCitations.var"#73#75", debug::Base.CoreLogging.LogLevel)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:196
  [5] tex_to_markdown(tex_str::SubString{String}; transform_case::Function, debug::Base.CoreLogging.LogLevel)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:131
  [6] tex_to_markdown
    @ ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:125 [inlined]
  [7] _initial(name::String)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/formatting.jl:25

but when I do DocumenterCitations.tex_to_markdown(raw"Mikolov, Tom{\'a}{\v s}") I get correctly "Mikolov, Tomáš".

goerz commented 1 day ago

I’ll look into this at some point.

It seems like the zotero-better-bibtex plugin has an option to keep Unicode.

You should absolutely 100% enable that. I’m actually really confused about the statement in their README

Unfortunately, for those shackled to BibTeX and who cannot (yet) move to BibLaTeX, unicode is a major PITA.

I have all my .bib files in Unicode, and I’m using plain BibTeX, not BibLaTeX. It has “just worked” for the last 15 years (maybe since pdflatex started to exist?). As far as I can tell, it’s just not a problem anymore, and nobody should use these tex escapes anymore.

blegat commented 1 day ago

Good point, if I untick the checkbox "Export unicode as plain-text..." then I get rid of the errors. If I also select "in the 'url' field", below in the screenshot I also get rid of the warnings complaining that there is an "urldate" without a "url" because by default, "Add URLs to BibTeX export" was "No". So I think you can recommend Zotero users to use these settings. zotero

I also tried BibLaTeX export but I got an error, see https://github.com/Humans-of-Julia/BibInternal.jl/issues/33

trontrytel commented 1 day ago

I got similar errors with DocumenterCitations v1.3.5 yesterday (things were fine with older versions). I was able to fix them by removing TeX syntax: https://github.com/CliMA/CloudMicrophysics.jl/pull/483

Thank you!

goerz commented 1 day ago

The reason things might have worked in v1.3.4 stopped working in v1.3.5 was that the solution to #78 was to try to convert latex to unicode before obtaining the initials for first names. That means first names are now processed, while they weren't before, and if there was anything in a first name that trips up the conversion, it breaks. I actually ran into that myself.

Ultimately, the bottom line is that DocumenterCitations requires Unicode. Any handling of LaTeX commands will always be an incomplete and heuristic fallback, and not officially supported.

goerz commented 6 hours ago

For […] author = {{\"U}nl{\"u}, {\c C}a{\u g}lar} […] I get […] Premature end of tex string: BoundsError("{\\c", 4)

This particular case seems to be a bug in Bibliography.jl: https://github.com/Humans-of-Julia/BibParser.jl/issues/39

I also think that zotero-better-bibtex isn't really using the "correct" escape sequences here. They should probably stick to the ones officially supported by BibTeX. For this example, that would be

@misc{Unlu2024,
  title = {More issues with escaped unicode},
  author = {\"{U}nl\"{u}, \c{C}a\u{g}lar},
  year = {2024},
  note = {Bug Report #85},
}

which works fine.

When I try DocumenterCitations.tex_to_markdown(raw"{\"U}nl{\"u}, {\c C}a{\u g}lar"), I get "\"Unl\"u, Çağlar", which seems indeed weird because \"u" is not replaced by the unicode character.

No, that's actually an issue with the raw string: Raw strings in Julia aren't quite as raw as one might think: quotes still have to be escaped, and then the escape has to be escaped. You'd have to write that as

@test tex_to_markdown(raw"{\\\"U}nl{\\\"u}, {\c C}a{\u g}lar") == "Ünlü, Çağlar"

which works.

goerz commented 5 hours ago

@trontrytel

I got similar errors with DocumenterCitations v1.3.5 yesterday (things were fine with older versions). I was able to fix them by removing TeX syntax: https://github.com/CliMA/CloudMicrophysics.jl/pull/483

The only entry I can reproduce as failing is Lehtinen2007, and that's failing due to the same bug in Bibliography: https://github.com/Humans-of-Julia/BibParser.jl/issues/39#issuecomment-2480606573

Unfortunately, your "fix" of removing the braces is actually not correct: it changes the last name "Dal Maso" to "Maso" with "Dal" as a middle name. The correct way to handle this is to use the "Last, First" format.

@article{Lehtinen2007,
  title = {Estimating nucleation rates from apparent particle formation rates and vice versa: Revised formulation of the Kerminen–Kulmala equation},
  author = {Lehtinen, Kari E.J. and Dal Maso, Miikka and Kulmala, Markku and Kerminen, Veli-Matti},
  journal = {Journal of Aerosol Science},
  volume = {38},
  number = {9},
  pages = {988-994},
  year = {2007},
  doi = {10.1016/j.jaerosci.2007.06.009}
}

I strongly recommend always using that format (and to make sure that any automatic exporter uses it)

goerz commented 4 hours ago

So this doesn't really seem actionable on my side, but I'll keep this issue open until https://github.com/Humans-of-Julia/BibParser.jl/issues/39 is resolved.

Meanwhile, there's some additional testing in b8c5de304741943a142afa3e1552d4d3995f5269.

JuliaDocs / DocumenterCitations.jl

More issues with escaped unicode #85