Add quality check and cleanup for problematic unicode characters

tobiasdiez commented 10 months ago

Is your suggestion for improvement related to a problem? Please describe.

Some unicode characters make problems, even with biblatex support (eg pdflatex still not completely supporting unicode). For example, Garcı́a gives

Package inputenc Error: Unicode character ́ (U+0301)

A few of such problematic characters are:

Describe the solution you'd like

As these characters are hard to recognize, it would be nice if there would be an integrity check warning about them, and an automatic cleanup to convert them to their unproblematic equivalents (e.g. 0131 + 0301 to 00ED).

Additional context Might be helpful: https://github.com/zepinglee/citeproc-lua/blob/ab3ce712cc92073f12be26ff0b22b30eb906092d/citeproc/citeproc-latex-data.lua#L517

Siedlerchr commented 10 months ago

Have you tried converting them to latex? We have latex2unicode and vice versa conversion already

tobiasdiez commented 10 months ago

It actually came from latex code that I converted to unicode (I want all my stuff in unicode). This is also not very helpful in recognizing which entry/field has the problematic character.

koppor commented 10 months ago

The feature will be listed in the Check integrity dialog of JabRef.

The implementation will be similar to org.jabref.logic.integrity.AmpersandChecker.

yuyan-z commented 9 months ago

Hi koppor, thank you for suggesting this issue to me! I hope to take it. I try to reproduce this problem:

create an example test.bib with a problematic unicode character

@Article{test,
author = {Garcı́a},
title  = {Test Article},
}

import the test.bib into the library in JabRef. There‘s no error in this step

create an example document.tex

\documentclass[12pt]{article}
{
\begin{document}
    \begin{enumerate}
        \item Sample Citation: \cite{test}
    \end{enumerate}

    \bibliographystyle{apalike}
    \bibliography{test.bib}
\end{document}
}

build document.tex

$ pdflatex document.tex
$ bibtex document
$ pdflatex document.tex
$ pdflatex document.tex

There's the error

! LaTeX Error: Unicode character ́ (U+0301)
           not set up for use with LaTeX.

I wonder if the goal is to automatically convert the problematic unicode character when importing or adding bib files in JabRef ?

koppor commented 9 months ago

Perfectly reproduced! 👍

Did you see my comment https://github.com/JabRef/jabref/issues/10506#issuecomment-1783939962?

Click

Issue appears

[ ] Side TODO: Please let JabRef focus the tab where the issue occurs

I think, what @tobiasdiez would like to have, is some warning at a field - if the field misses an integrity check:

Note that the non-ascii check should be on only at bibtex mode, not in biblatex mode.

Note that the integrity checks should be turned on/off per library (maybe too much for this PR).

If one wants to get it compiling:

Try biber instead of bibtex. Or try bibtex8. The normal bibtex tool doesn't handle utf8 properly.

You can also try to use biber. See https://tex.stackexchange.com/a/34136/9075 for a hint.

koppor commented 9 months ago

integrity check warning about them ... and an automatic cleanup .. It actually came from latex code that I converted to unicode (I want all my stuff in unicode). This is also not very helpful in recognizing which entry/field has the problematic character.

@tobiasdiez

JabRef has a check for non-ASCII-characters. See my screenshot at https://github.com/JabRef/jabref/issues/10506#issuecomment-1820790480. I think, this fulfills your "integrity check warning" wish. Could you retry with your JabRef
We have the unicode-to-latex conversion. We also do have automatic save. Please try to activate the converter "on save"\

On save, JabRef pops up "file was modified externally". Then, you even have a character diff.

Does that work for you?

@tobiasdiez @Siedlerchr I am not sure how to guide the student. I recommended him to put the checkers into the entry editor on type. Because it did not work there. Is this OK - Or should we find yet another issue?

ThiloteE commented 9 months ago

The goal is not to automatically convert the symbols, because while unicode engines like LuaTeX and XeTeX can read the unicode characters, there are problems with older engines like pdfTeX. We can bridge the gap by detecting these characters in JabRef and hope PDFTeX will eventually catch up, or what is more likely: Users will stop using pdfTeX.

Given the choice, I would assume most people would prefer not having to convert an à in their text to

\`{a}

just because their font engine can't read it. They probably would prefer an engine that simply works without having to do magic conversions. By the way it was hard to cite this non-precomposed character in markdown xD

I am not sure why we would force users that already use more modern unicode engines to convert their precomposed unicode characters like à back into non-precomposed characters like

\`{a}

in their entries. Manual conversion is fine, but no need for automatic conversion I think, no?

PdfTeX is still maintained, but there are not a lot of updates to their repo. See here: https://tug.org/applications/pdftex/. Postscript fonts, which are natively supported by pdfTeX seem outdated and being dropped by many operating systems and applications, so at one point the reason for pdfTeX's existence will fade away and people will move to other font enginges. I think we should make it hard for users to stick to the outdated pdfTeX and incentivise users switching to unicode compatible engines.

I propose the path forward for JabRef should be as follows:

Have a (long) grace period with a warning that these characters are not supported by pdfTeX and offer converting characters to their unproblematic equivalents, but do not do so automatically, instead offer manual conversion in the cleanup dialogue. The warning should include pointing to alternative modern engines like LuaTeX or XeTeX that support unicode.
In a future version of JabRef (very far in the future), drop support for manual conversion and only offer unicode characters.

tobiasdiez commented 9 months ago

Note that the issue goes beyond the usual "bibtex is not compatible with unicode". As @ThiloteE correctly analyzed, the problem is the combination pdftex + biber (in particular the ascii check is not helpful).

The simplest solution would be indeed an automatic conversion of unicode characters to the Normal Form C, or at least combine unicode characters if they have an single-character equivalent. So à can stay the same but 0131 + 0301 is converted to 00ED (but not to its latex code). Since by definition these unicode representations are the equivalent, lualatex/xetex will display the same character - it's just to help pdftex. Alternatively, implement it as a save-action that is on by default.

koppor commented 9 months ago

Ah, I see.

Naively, this can be achieved, by running unicode-to-latex and latex-to-unicode, because our unicode tables use the normal form c. -- However, this is much effort (See https://github.com/JabRef/jabref/pull/6155)

Similar functionality as org.jabref.logic.layout.format.ReplaceUnicodeLigaturesFormatter, but for the character "compression".

@tobiasdiez Do you propose a manual table as our org.jabref.logic.util.strings.UnicodeLigaturesMap#UnicodeLigaturesMap, but for Normal Form C? -- If yes, then this is a good first issue. Otherwise, I need to take back the assingment as good-first-issue.

tobiasdiez commented 9 months ago

@tobiasdiez Do you propose a manual table as our org.jabref.logic.util.strings.UnicodeLigaturesMap#UnicodeLigaturesMap, but for Normal Form C? -- If yes, then this is a good first issue. Otherwise, I need to take back the assingment as good-first-issue.

Yes. It also doesn't have to cover all known character compressions. The ones containing some of the problematic linked in the issue description should be good enough for now.

Siedlerchr commented 9 months ago

Latex2Unicode library also uses NFC, or at least we use it https://github.com/JabRef/jabref/blob/4718930a6f32d94956caa352c49777864ea2b823/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java#L28-L31

https://github.com/JabRef/jabref/blob/4718930a6f32d94956caa352c49777864ea2b823/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java#L43-L46

koppor commented 8 months ago

Two things to do

New Integrity check: String result = normalize-to-fc(input); raise error if result != input
New Cleanup/FieldFormatter/ ...: result = normalize-to-fc(input);

harsh1898 commented 7 months ago

Hi @koppor As you have mentioned, we need to do two things,so can you elaborate which result you are pointing or just elaborate your last comment

I have sucessfully reproduced the bug/issue and figure out with the help of above thread comments.

koppor commented 7 months ago

@harsh1898 Do you know Ctrl+Shift+F in IntelliJ? Here, you can search for code.

Integrity Check

The class is org.jabref.logic.integrity.IntegrityCheck. With Alt+F1 and then Enter, you can navigate to the package in the project view. Then, you find other integrity checks. I browsed around and found ValueChecker. Think, the implementation is as follows:

Implement UnicodeNormalFormCCheck in package org.jabref.logic.integrity. It implements interface ValueChecker.
- See https://github.com/JabRef/jabref/issues/10506#issuecomment-1820998978 for an implementation hint.
- Use Ctrl+Shift+T to generate a skeleton of a test class. You can see other test classes outlining how to implement (e.g., org.jabref.logic.integrity.BibStringCheckerTest)
Add the UnicodeNormalFormCCheck to org.jabref.logic.integrity.FieldCheckers#getAllMap (in the biblatex mode branch).
Check if it appears in the UI and test it with the example

New cleanup action

Create a new formatter NormalizeUnicodeFormatter in org.jabref.logic.formatter.bibtexfields. Also create test cases
Add it to org.jabref.logic.formatter.Formatters#getOthers.
Check if it appears in the UI and test it with the example

github-actions[bot] commented 7 months ago

As a general advice for newcomers: check out Contributing for a start. Also, guidelines for setting up a local workspace is worth having a look at.

Feel free to ask here at GitHub, if you have any issue related questions. If you have questions about how to setup your workspace use JabRef's Gitter chat. Try to open a (draft) pull-request early on, so that people can see you are working on the issue and so that they can see the direction the pull request is heading towards. This way, you will likely receive valuable feedback.

harsh1898 commented 7 months ago

Hi @koppor As per your suggestion, I have tried to fix this issue with some update in code repository.

You can review this #10817 to see my updates and Pull Request.

JabRef / jabref

Add quality check and cleanup for problematic unicode characters #10506

Integrity Check

New cleanup action