Add integrity check for LaTeX special characters

ThiloteE commented 2 years ago

This issue was inspired by inspecting and working on issues #8673, https://github.com/plurimath/unicode2latex/issues/19, #8650, #8682 and #3644

Problem:

Special characters do special things in LaTeX (or when being compiled by Biber).
- See Page 15 (Table 1) for escapeable "special" characters in LaTeX
- See Page 15 (Table 2) for Predefined LATEX 2𝜀 Text-mode Commands\
- See Page 16 (Table 3) for "Commands Defined to Work in Both Math and Text Mode"\
Special characters are hard to convert from LaTeX to Unicode and vice versa, because context is relevant! (e.g. being in Text Mode or Math Mode or another special LaTeX environment). Using special characters may necessitate user action.

Desired solution:

Users should be able to easily find existing "special characters" in their database.
Create integrity check(s) for having the following symbols in your entries:
- From Table 1: LaTeX escapable "special" characters:
- [ ] %, $, _, # and &
- Not necessary: Detect \%, \$, \_, \# and \&, because this should be handled at text paste.
- From Table 2: Predefined LATEX 2𝜀 Text-mode Commands:
  - [ ] See Table 2. I am too lazy to copy / paste them all here.
  - Comment: I tried pasting a few of these symbols into JabRef. NOT ALL could be entered into the JabRef title field. Feel free to experiment and implement an integrity check for characters that can easily be used with JabRef. Would be nice to create a list of Symbols from Table 2 that JabRef can't easily deal with.
- Table 3: Commands Defined to Work in Both Math and Text Mode:
  - [ ] ... - PS. not yet sure about this one. Maybe the ... should be added to the Unicode2LaTeX converter instead. Please Exclude this character for now. Work on the others first!
  - All the other characters in table 3 can be converted from Unicode to LaTeX and from LaTeX to Unicode (except _ and \_), so no need to implement an integrity check for them! :-)

Additional info:

I believe an integrity check for # is already implemented. Maybe this integrity check can be used as a template.

ThiloteE commented 2 years ago

The workflow for users is as follows:

Decide if Database should be exported to LaTeX/OpenOffice/LibreOffice/MicrosoftOffice (Each of them may require entries to be formatted in a specific way (e.g. LaTeX, Unicode, HTML etc.) to be rendered correctly).
Use integrity check
Decide if characters found by integrity check are correct or not.
If not, change characters
- Either manually
- Or via "cleanup actions"

systemoperator commented 11 months ago

@Siedlerchr @koppor With this new feature, I receive now tons of LaTeX Parsing Errors, when using "Check integrity", which irritated me for a while and made me think:

Field: File Message: LaTeX Parsing Error: Command \: cannot be used in PARAG...

This results because of using the JabRef Browser Extension, which adds linked files as follows:

file = {Full Text PDF:https\://www.jair.org/index.php/jair/article/download/11675/26513:application/pdf;Snapshot:https\://www.jair.org/index.php/jair/article/view/11675:text/html;:Jauhiainen2019 - Automatic Language Identification in Texts_ a Survey.pdf:PDF},

koppor commented 11 months ago

@systemoperator Thank you for letting us know! CC @Zylence

koppor commented 11 months ago

@systemoperator A new version with a fix be available in 30min at https://github.com/JabRef/jabref/pull/10401. There should then be a comment appearing linking to the binaries.

systemoperator commented 11 months ago

@systemoperator A new version with a fix be available in 30min at #10401. There should then be a comment appearing linking to the binaries.

Awesome! I'm gonna check this out.

Zylence commented 11 months ago

This results because of using the JabRef Browser Extension

I am terribly sorry, I did not check for this usecase. ( Honestly I only tested manual input. 😐)

koppor commented 11 months ago

No worries at all - and thank @systemoperator for the quick report!

Background: This is one consequence of our "merge early" development strategy. Instead of asking users to try out PRs, we "just" merge and fix ASAP if something goes wrong. We get more velocity into JabRef - and I think, this good for the tool as whole. -- Otherwise, we had many PRs piling up, because some details were missing. Then, they got abandoned and never got revived. OK, most of the non-merged PRs were 20% solutions. These currently pile up at https://github.com/koppor/jabref/pulls

systemoperator commented 11 months ago

@koppor The mentioned error has gone now. :)

I've noticed other things for imported entries, using the JabRef Browser Extension:

For DOI fields: LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode Examples: 10.1007/0-387-34805-0_39, 10.1007/3-540-49264-X_24

For Comment, Comment-username, Abstract, Title and Journaltitle fields: LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode LaTeX Parsing Error: Undefined command \textgreater LaTeX Parsing Error: Undefined command \textbackslash LaTeX Parsing Error: Undefined command \textbar LaTeX Parsing Error: Undefined command \textless

koppor commented 11 months ago

TBH, I don't use the integrity check, too much to fix 🙈. Thus, I am relying on feedback here.

Currently, we just exclude "verbatim" fields. I think, all identifier fields should also be classified as verbatim. On the other hand, DOI etc should be checked for HTML characters (shouldn't they)?

Regarding comment and abstract. The should have proper LaTeX. Maybe, we need to allow the commands you mentioned?

koppor commented 11 months ago

Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?

koppor commented 11 months ago

We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...

systemoperator commented 11 months ago

Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?

In my case, I pasted some URLs into the Comments or Comments-username field, which contained the underscores. e.g. https://www.bhu.ac.in/research_pub/jsr/Volumes/JSR_65_01_2021

Zylence commented 11 months ago

Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?

In my case, I pasted some URLs into the Comments or Comments-username field, which contained the underscores. e.g. https://www.bhu.ac.in/research_pub/jsr/Volumes/JSR_65_01_2021

In that case it would be expected to fail, as _ and ^ are only allowed in latex math environments.

systemoperator commented 11 months ago

TBH, I don't use the integrity check, too much to fix 🙈. Thus, I am relying on feedback here.

Sure. It's nothing, which breaks something, it's just informative.

Currently, we just exclude "verbatim" fields. I think, all identifier fields should also be classified as verbatim. On the other hand, DOI etc should be checked for HTML characters (shouldn't they)?

Or only allowing URL characters. (maybe excluding &, ? and #?)

Regarding comment and abstract. The should have proper LaTeX. Maybe, we need to allow the commands you mentioned?

I guess, there would be more commands to consider, which can show up.

systemoperator commented 11 months ago

We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...

Concerning the title, some authors use extraordinary characters there. :thinking:

Zylence commented 11 months ago

LaTeX Parsing Error: Undefined command \textgreater LaTeX Parsing Error: Undefined command \textbackslash LaTeX Parsing Error: Undefined command \textbar LaTeX Parsing Error: Undefined command \textless

I guess, there would be more commands to consider, which can show up.

The implementation of the underlying snuggletex is not complete. To add new commands, we'd either need to make PRs to snuggletex or to inject new commands as shown here: https://github.com/koppor/jabref/pull/646#issue-1714138154 (takeaways section)

The latter is probably faster.

systemoperator commented 11 months ago

I have pasted URLs into the URL field, where also some parameters exist, which are concatenated with the &. Resulting in: Found 2 unescaped '&'.

The same applies to the File field, where the JabRef Browser Extension sends the Links, which can also contain "&" characters. Resulting in: Found 2 unescaped '&'.

Examples:

Zylence commented 11 months ago

We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...

Ideas ranked from most to least favorite:

As you said, exclude fields (at least those that are filled in automatically by for example the browser extension)
Separate the latex integrity check (but that would be bad style)
Enforce url encoding as in anything other than a url you should simply be able to escape the problematic characters. (but that's not userfriendly and may cause other similar errors)

Can not really come up with a better idea then the one proposed first. May need a little more time to think.

Zylence commented 11 months ago

Found 2 unescaped '&'

Okay, but that's the "fault" of the AmpersandChecker which was introduced earlier.

~As the problem seems to be predominantly induced by urls, url-encoding as solution now ranks higher in my opinion. We could intercept paste and additionally warn the user.~

~So for example:~ ~> https://en.wikipedia.org/w/index.php?title=Information_extraction&oldid=948091115~

~becomes:~ ~https%3A%2F%2Fen.wikipedia.org%2Fw%2Findex.php%3Ftitle%3DInformation%5fextract%0Aion%26oldid%3D948091115~

~which is manageable.~

[Edit]: Just realized that won't make much sense for that case, since the ampersands are not part of the queryparameter.

Siedlerchr commented 11 months ago

Please don't mess with url fields. Rather exclude them. As far as I know url is also a verbatim field

Julian Kirsch @.***> schrieb am Fr., 22. Sept. 2023, 01:45:

Found 2 unescaped '&'

Okay, but that's the "fault" of the AmpersandChecker which was introduced earlier https://github.com/JabRef/jabref/pull/9758.

As the problem seems to be predominantly induced by urls, url-encoding now ranks higher in my opinion. We could intercept paste and additionally warn the user.

So for example:

https://en.wikipedia.org/w/index.php?title=Information_extraction&oldid=948091115

becomes: https%3A%2F%2Fen.wikipedia.org %2Fw%2Findex.php%3Ftitle%3DInformation%5fextract%0Aion%26oldid%3D948091115

which is manageable.

— Reply to this email directly, view it on GitHub https://github.com/JabRef/jabref/issues/8712#issuecomment-1730500851, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOFZHM65TJEOFXUUIARODX3TGSFANCNFSM5UCQBSGQ . You are receiving this because you were mentioned.Message ID: @.***>

koppor commented 11 months ago

Please don't mess with url fields. Rather exclude them. As far as I know url is also a verbatim field

Then the ampersand checker needs to be updated to exclude verbatim fields. #quickwin

koppor commented 11 months ago

The implementation of the underlying snuggletex is not complete. To add new commands, we'd either need to make PRs to snuggletex or to inject new commands as shown here: koppor#646 (comment) (takeaways section)

As usual, I would do both ^^ -- PR to https://github.com/davemckain/snuggletex (reference is more for me)

However, I am not 100% sure about the error "Undefined command". We can never be package-complete. On the one hand, I assume the sources come from publishers, which "should" use correct commands. On the other hand, we should add the most common commands to avoid "typos" (\textsupsrcipt vs. \textsuperscript (correct)).

systemoperator commented 11 months ago

@koppor I've checked the latest build.

I would recommend, allowing underscores in DOI fields, since they are not infrequent:

In the comment-{username} field, I would recommend allowing any character.

Furthermore, in my opinion, it would make sense to adapt the messages in the integrity check so that "LaTeX Parsing Error" is only printed, if there is REALLY some LaTeX parsing error. (Initially, I thought, something is broken in my file, but actually, everything was fine.) I think, a message like e.g. "LaTeX Parsing Warning" or something different (like "LaTeX Inconsistency", or even "LaTeX Improvement") would be more appropriate, so that a user does not panic, when reading it. Basically any message in the integrity check suggests improvements in terms of inconsistencies, without anything being broken. (Probably, if something is really broken, like a wrongly inserted "}" or "{", in the bib-file, I guess, even Jabref cannot properly parse it anymore.

koppor commented 11 months ago

@systemoperator I fully agree. Then, my PR https://github.com/JabRef/jabref/pull/10436 does not work. I need to re-check. Sorry for that many iterations on this. I will also work on the text. I think, LaTeX Warning could be good.

On the one hand, we could classify the errors at https://github.com/davemckain/snuggletex/blob/development_1_2_x/snuggletex-core/src/main/resources/uk/ac/ed/ph/snuggletex/core-error-messages.properties between warning and error.

On the other hand, always "Warning" feels currently right.

The issue with the missing } needs to be checked. JabRef currently drops too much entries if too much } are contained. Our parser needs to improved with that regard.

Siedlerchr commented 11 months ago

@systemoperator I was testing this again today and cannot reproduce your issue. For me, no error related to DOI is raised e..g DOI: 10.1007/978-3-642-01190-0_17

koppor commented 10 months ago

@Siedlerchr After three iterations, we managed. 😅🙈🎉🎉

systemoperator commented 10 months ago

I've checked it with my reference library, with more than 500 entries. Everything is printed properly now. :tada:

JabRef / jabref

Add integrity check for LaTeX special characters #8712