Open ThiloteE opened 2 years ago
The workflow for users is as follows:
@Siedlerchr @koppor With this new feature, I receive now tons of LaTeX Parsing Errors, when using "Check integrity", which irritated me for a while and made me think:
Field: File
Message: LaTeX Parsing Error: Command \: cannot be used in PARAG...
This results because of using the JabRef Browser Extension, which adds linked files as follows:
file = {Full Text PDF:https\://www.jair.org/index.php/jair/article/download/11675/26513:application/pdf;Snapshot:https\://www.jair.org/index.php/jair/article/view/11675:text/html;:Jauhiainen2019 - Automatic Language Identification in Texts_ a Survey.pdf:PDF},
@systemoperator Thank you for letting us know! CC @Zylence
@systemoperator A new version with a fix be available in 30min at https://github.com/JabRef/jabref/pull/10401. There should then be a comment appearing linking to the binaries.
@systemoperator A new version with a fix be available in 30min at #10401. There should then be a comment appearing linking to the binaries.
Awesome! I'm gonna check this out.
This results because of using the JabRef Browser Extension
I am terribly sorry, I did not check for this usecase. ( Honestly I only tested manual input. 😐)
No worries at all - and thank @systemoperator for the quick report!
Background: This is one consequence of our "merge early" development strategy. Instead of asking users to try out PRs, we "just" merge and fix ASAP if something goes wrong. We get more velocity into JabRef - and I think, this good for the tool as whole. -- Otherwise, we had many PRs piling up, because some details were missing. Then, they got abandoned and never got revived. OK, most of the non-merged PRs were 20% solutions. These currently pile up at https://github.com/koppor/jabref/pulls
@koppor The mentioned error has gone now. :)
I've noticed other things for imported entries, using the JabRef Browser Extension:
For DOI fields:
LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode
Examples: 10.1007/0-387-34805-0_39, 10.1007/3-540-49264-X_24
For Comment, Comment-username, Abstract, Title and Journaltitle fields:
LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode
LaTeX Parsing Error: Undefined command \textgreater
LaTeX Parsing Error: Undefined command \textbackslash
LaTeX Parsing Error: Undefined command \textbar
LaTeX Parsing Error: Undefined command \textless
TBH, I don't use the integrity check, too much to fix 🙈. Thus, I am relying on feedback here.
Currently, we just exclude "verbatim" fields. I think, all identifier fields should also be classified as verbatim. On the other hand, DOI etc should be checked for HTML characters (shouldn't they)?
Regarding comment and abstract. The should have proper LaTeX. Maybe, we need to allow the commands you mentioned?
Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?
We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...
Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?
In my case, I pasted some URLs into the Comments or Comments-username field, which contained the underscores. e.g. https://www.bhu.ac.in/research_pub/jsr/Volumes/JSR_65_01_2021
Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?
In my case, I pasted some URLs into the Comments or Comments-username field, which contained the underscores. e.g. https://www.bhu.ac.in/research_pub/jsr/Volumes/JSR_65_01_2021
In that case it would be expected to fail, as _ and ^ are only allowed in latex math environments.
TBH, I don't use the integrity check, too much to fix 🙈. Thus, I am relying on feedback here.
Sure. It's nothing, which breaks something, it's just informative.
Currently, we just exclude "verbatim" fields. I think, all identifier fields should also be classified as verbatim. On the other hand, DOI etc should be checked for HTML characters (shouldn't they)?
Or only allowing URL characters. (maybe excluding &, ? and #?)
Regarding comment and abstract. The should have proper LaTeX. Maybe, we need to allow the commands you mentioned?
I guess, there would be more commands to consider, which can show up.
We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...
Concerning the title, some authors use extraordinary characters there. :thinking:
LaTeX Parsing Error: Undefined command \textgreater LaTeX Parsing Error: Undefined command \textbackslash LaTeX Parsing Error: Undefined command \textbar LaTeX Parsing Error: Undefined command \textless
I guess, there would be more commands to consider, which can show up.
The implementation of the underlying snuggletex is not complete. To add new commands, we'd either need to make PRs to snuggletex or to inject new commands as shown here: https://github.com/koppor/jabref/pull/646#issue-1714138154 (takeaways section)
The latter is probably faster.
I have pasted URLs into the URL field, where also some parameters exist, which are concatenated with the &.
Resulting in: Found 2 unescaped '&'.
The same applies to the File field, where the JabRef Browser Extension sends the Links, which can also contain "&" characters.
Resulting in: Found 2 unescaped '&'.
Examples:
We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...
Ideas ranked from most to least favorite:
Can not really come up with a better idea then the one proposed first. May need a little more time to think.
Found 2 unescaped '&'
Okay, but that's the "fault" of the AmpersandChecker which was introduced earlier.
~As the problem seems to be predominantly induced by urls, url-encoding as solution now ranks higher in my opinion. We could intercept paste and additionally warn the user.~
~So for example:~ ~> https://en.wikipedia.org/w/index.php?title=Information_extraction&oldid=948091115~
~becomes:~ ~https%3A%2F%2Fen.wikipedia.org%2Fw%2Findex.php%3Ftitle%3DInformation%5fextract%0Aion%26oldid%3D948091115~
~which is manageable.~
[Edit]: Just realized that won't make much sense for that case, since the ampersands are not part of the queryparameter.
Please don't mess with url fields. Rather exclude them. As far as I know url is also a verbatim field
Julian Kirsch @.***> schrieb am Fr., 22. Sept. 2023, 01:45:
Found 2 unescaped '&'
Okay, but that's the "fault" of the AmpersandChecker which was introduced earlier https://github.com/JabRef/jabref/pull/9758.
As the problem seems to be predominantly induced by urls, url-encoding now ranks higher in my opinion. We could intercept paste and additionally warn the user.
So for example:
https://en.wikipedia.org/w/index.php?title=Information_extraction&oldid=948091115
becomes: https%3A%2F%2Fen.wikipedia.org %2Fw%2Findex.php%3Ftitle%3DInformation%5fextract%0Aion%26oldid%3D948091115
which is manageable.
— Reply to this email directly, view it on GitHub https://github.com/JabRef/jabref/issues/8712#issuecomment-1730500851, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACOFZHM65TJEOFXUUIARODX3TGSFANCNFSM5UCQBSGQ . You are receiving this because you were mentioned.Message ID: @.***>
Please don't mess with url fields. Rather exclude them. As far as I know url is also a verbatim field
Then the ampersand checker needs to be updated to exclude verbatim fields. #quickwin
The implementation of the underlying snuggletex is not complete. To add new commands, we'd either need to make PRs to snuggletex or to inject new commands as shown here: koppor#646 (comment) (takeaways section)
As usual, I would do both ^^ -- PR to https://github.com/davemckain/snuggletex (reference is more for me)
However, I am not 100% sure about the error "Undefined command". We can never be package-complete. On the one hand, I assume the sources come from publishers, which "should" use correct commands. On the other hand, we should add the most common commands to avoid "typos" (\textsupsrcipt
vs. \textsuperscript
(correct)).
@koppor I've checked the latest build.
I would recommend, allowing underscores in DOI
fields, since they are not infrequent:
In the comment-{username}
field, I would recommend allowing any character.
Furthermore, in my opinion, it would make sense to adapt the messages in the integrity check so that "LaTeX Parsing Error" is only printed, if there is REALLY some LaTeX parsing error. (Initially, I thought, something is broken in my file, but actually, everything was fine.) I think, a message like e.g. "LaTeX Parsing Warning" or something different (like "LaTeX Inconsistency", or even "LaTeX Improvement") would be more appropriate, so that a user does not panic, when reading it. Basically any message in the integrity check suggests improvements in terms of inconsistencies, without anything being broken. (Probably, if something is really broken, like a wrongly inserted "}" or "{", in the bib-file, I guess, even Jabref cannot properly parse it anymore.
@systemoperator I fully agree. Then, my PR https://github.com/JabRef/jabref/pull/10436 does not work. I need to re-check. Sorry for that many iterations on this. I will also work on the text. I think, LaTeX Warning
could be good.
On the one hand, we could classify the errors at https://github.com/davemckain/snuggletex/blob/development_1_2_x/snuggletex-core/src/main/resources/uk/ac/ed/ph/snuggletex/core-error-messages.properties between warning and error.
On the other hand, always "Warning" feels currently right.
The issue with the missing }
needs to be checked. JabRef currently drops too much entries if too much }
are contained. Our parser needs to improved with that regard.
@systemoperator I was testing this again today and cannot reproduce your issue. For me, no error related to DOI is raised e..g DOI: 10.1007/978-3-642-01190-0_17
@Siedlerchr After three iterations, we managed. 😅🙈🎉🎉
I've checked it with my reference library, with more than 500 entries. Everything is printed properly now. :tada:
This issue was inspired by inspecting and working on issues #8673, https://github.com/plurimath/unicode2latex/issues/19, #8650, #8682 and #3644
Problem:
Desired solution:
Create integrity check(s) for having the following symbols in your entries:
From Table 1: LaTeX escapable "special" characters:
[ ]
%
,$
,_
,#
and&
Not necessary: Detect
\%
,\$
,\_
,\#
and\&
, because this should be handled at text paste.From Table 2: Predefined LATEX 2𝜀 Text-mode Commands:
Table 3: Commands Defined to Work in Both Math and Text Mode:
...
- PS. not yet sure about this one. Maybe the...
should be added to the Unicode2LaTeX converter instead. Please Exclude this character for now. Work on the others first!_
and\_
), so no need to implement an integrity check for them! :-)Additional info: