JabRef / jabref

Graphical Java application for managing BibTeX and biblatex (.bib) databases
https://devdocs.jabref.org
MIT License
3.47k stars 2.44k forks source link

Finding files with non-alphabetical characters in title #4342

Closed tb77phd closed 3 years ago

tb77phd commented 5 years ago

JabRef version 4.3.1 on Windows 10 64-bit (I don't have administrative rights to install developer snapshots, e.g. version 5.0)

I like to use regular expression search when linking local files and use the following: */.[title].*\.[extension]

As far as I know, that should search in the directory and subdirectories of the main file directory for files that contain the title and the proper extension (.pdf). Contain, right? The file name does not need to start with the title.

However, choosing an article in my bibliography and pressing F7 does nothing (status bar simply tells that it finished and found nothing).

What does F7 (Automatically set file links) do? I've tested with many files and get the same result, even when making sure the file name contains no special characters or anything like that.

tb77phd commented 5 years ago

Okay, after a bit more playing around, it seems that a regular dash is a character that makes F7 fail. In fact, it seems characters in the title (in the file name) that are not regular letters make F7 fail.


Example 1 File: BEEMAN & PINCUS - PHYS REV 1968 - Nuclear Spin-Lattice Relaxation in Magnetic Insulators.pdf

Title in JabRef: Nuclear Spin-Lattice Relaxation in Magnetic Insulators

F7 does not find the file, unless I change the filename to BEEMAN & PINCUS - PHYS REV 1968 - Nuclear Spin Lattice Relaxation in Magnetic Insulators.pdf. (Removing the dash in "Spin-Lattice".)


Example 2 File: GUO ea - INORG CHEM COMMUN 2010 - Ferroelectric Metal Organic Framework (MOF).pdf

Title in JabRef: Ferroelectric Metal Organic Framework (MOF)

F7 does not find the file, unless I change the filename to GUO ea - INORG CHEM COMMUN 2010 - Ferroelectric Metal Organic Framework MOF.pdf. (Removing the parentheses around "MOF".)


Note: in both cases, I did not need to change the title in JabRef, only in the filename.

jonasstein commented 5 years ago

Could you check, if it was really a -? Perhaps it was one of the UTF-8 symbols similar to -.

tb77phd commented 5 years ago

I did by removing it and then retyping with the regular dash. No change. F7 only linked when I removed the dash completely. And in my second example, there is no dash in the title, but parentheses.

tb77phd commented 5 years ago

I installed the 5.0-dev--snapshot on my personal computer (also Win10-64) and repeated the process. Same thing happens. F7 can't find the file unless I remove the dash (or the parentheses in my second example).

tb77phd commented 5 years ago

Out of curiosity, I tested to instead link using the bibtexkey, i.e., using **/.[bibtexkey]..[extension] as the regular expression. I chose to invoke dashes in my bibtexkeys so the examples above got the key Beeman-PR-1968 and I also renamed the local pdf file to this (Beeman-PR-1968.pdf).

No issue in finding the file this time when invoking "Automatically set file links" (F7) even though it contains a dash.

So... why can't a title search handle dashes and other characters but a bibtexkey search can?

tb77phd commented 3 years ago

An update to this two years old issue;

I have recently come back to JabRef and this issue still persisted in 5.1. Since I had over a 1000 files that were unlinked, I decided to revert to an old version to try things.

Recap of issue: I use the Regular Expression **/.*[title].*\\.[extension] to link files. In the versions since September 2018, JabRef could not find files that contained non-alphabetical (and non-numerical? not sure...) characters in the title. See examples above. Even a simple dash could not be handled.

I removed 5.1 and found and installed JabRef version 3.6.

Marking all of my over 2000 entries and hitting F7 (Automatically set file links), it worked like a charm! Suddenly, my list of over a 1000 unlinked files was down to less than 200! The remaining unlinked files were easily fixed (e.g., title has a colon but filename has a semicolon since Windows don't allow colons in filenames, or filename was truncated due to a very long title, stuff like that).

I don't know what changed between version 3.6 and (presumably) version 4.x to hinder a regular expression search containing non-alphabetical characters in the title (it didn't have an issue with such characters in the bibtexkey then but I didn't test it now).

Now that my entire library is linked, I'll happily come back to the latest version 5.2.

I'll change the title of this issue to a better description.

koppor commented 3 years ago

May I ask whether there is test data for this somewhere?

Maybe, three BibTeX entries and three unlinked files.

tb77phd commented 3 years ago

bib_pdfs.zip I hope uploading a ZIP with PDFs and a .bib file works.

One of the files (YOUNG ea ...) has numbers in the title. I tried F7 with it and it worked, so numbers should not be an issue.

Siedlerchr commented 3 years ago

@tb77phd I could reproduce the issue and it seems it's done on purpose. The "dash" character, addtional whitespace and other unwanted latex chars https://github.com/JabRef/jabref/blob/e955c46850e4738830198a889c3f09870b88ab30/src/main/java/org/jabref/logic/citationkeypattern/BracketedPattern.java#L570-L573

k3KAW8Pnf7mkmdSMPHz27 commented 3 years ago

Is the issue that the title is needed unchanged? As @Siedlerchr says, [title] makes modifications to the title, [TITLE] should not, so perhaps that works? (note that it does not "resolve" latex)

k3KAW8Pnf7mkmdSMPHz27 commented 3 years ago

Actually, would [TITLE:latex_to_unicode:regex(":",";")] work?

tb77phd commented 3 years ago

Huh, I was not aware of a difference between [title] and [TITLE]. I tried right now and it worked when the issue was a dash in the filename. It did not work when the issue was parentheses. Thanks for the suggestion!

k3KAW8Pnf7mkmdSMPHz27 commented 3 years ago

🀦 that does make sense, I missed that. If I am not mistaken, the expanded bracket will be interpreted as regexp, so parenthesis will be seen as a regexp matching groups and probably you won't get any result at all if they are unmatched.

~In your current version you might be able to solve it using either "\\Q...\\E" or "\Q...\E" (match the content between literally instead of as a regexp), e.g., \\Q[TITLE:latex_to_unicode:regex(":",";")]\\E, see https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/util/regex/Pattern.html#quote for more details. Even if this workaround works (which I am not completely sure of),~ I believe this must be changed in the code. You can escape parenthesises and brackets ()[] using the :regex("","") modifier, but I would not recommend it as it'll stop working as soon as this is fixed (a couple of days).

k3KAW8Pnf7mkmdSMPHz27 commented 3 years ago

Huh, I was not aware of a difference between [title] and [TITLE]

I am not sure if it was supposed to be a difference, but it is. And I am trying to make it official by sneakily making people use it 🀫

koppor commented 3 years ago

Huh, I was not aware of a difference between [title] and [TITLE] I am not sure if it was supposed to be a difference, but it is.

This is NOT in our documentation. See https://docs.jabref.org/setup/citationkeypatterns.

A field marker generally consists of the field name enclosed in square braces, e.g., [title]

I know at https://docs.jabref.org/setup/citationkeypatterns#bibentry-fields, we use upper case letters. IMHO this is wrong. It should be [title], [date] there, too.

The implementation having a difference between title and TITLE comes from following feature:

https://docs.jabref.org/setup/citationkeypatterns#title-related-field-markers

[title]: Capitalize all the significant words of the title, and concatenate them. For example, An awesome paper on JabRef becomes AnAwesomePaperonJabref [fulltitle]: The title with unchanged capitalization.

This was introduced before https://github.com/JabRef/jabref/pull/3670. Via https://github.com/JabRef/jabref/pull/3238, I could track some hint at b1a9593f1d7ae969f66dc5dde18db4d878676170.

To satisfy the old behaviour without breaking the makeLabel code, it was necessary to introduce the '[fulltitle]' field, which leaves the title unchanged, and to change the test for '[title]' expansion, which now removes hyphens ("-").

I could not find the original commit.

Nevertheless, this behavior is really odd.

I see a mixture of "special" field names, which should not appear in normal BibTeX. However, authors is very close to author in BibTeX.

Solution options

Option A: Maybe, we should convert all field markers to modifiers. We can keep the "old" behavior for compatibility reasons.

Option B: In case we do not want to change the whole behavior, we should solve the "overlapping" behavior at title/TITLE. We should rename title to sigtitle ("significant title"), use the current title magic.

Option C: In addition to Option B, we should remove all non-signifcant words. IMHO AnAwesomePaperonJabRef is wrong, because the capital letters do not help here.

Side notes

k3KAW8Pnf7mkmdSMPHz27 commented 3 years ago

I know at https://docs.jabref.org/setup/citationkeypatterns#bibentry-fields, we use upper case letters. IMHO this is wrong. It should be [title], [date] there, too.

Imo, the motivation for keeping the difference between title and TITLE is that a user doesn't necessarily expect, nor can deal with, a raw bibentry field. _I_ have tripped on the difference between [authors] and [author], and given how much time I have spent in that part of the source code, that is saying something. [authors] vs [AUTHOR] on the other hand.

  • It is not clear to me whether our documentation is wrong, I think "For an entry with the title An awesome paper on JabRef, the citation key pattern Title[title:abbr] will provide the key TitleAAPoJ." is wrong, it is [TITLE:abbr], isn't it?

I could not find the original commit.

I believe it is https://github.com/JabRef/jabref/pull/2610

koppor commented 3 years ago

Summary: Documentation is wrong. Only upper case field names are used to access the fields of the Bibentry directly. Thus, I updated the documentation accordingly:

grafik

Long answer:

I know at docs.jabref.org/setup/citationkeypatterns#bibentry-fields, we use upper case letters. IMHO this is wrong. It should be [title], [date] there, too. Imo, the motivation for keeping the difference between title and TITLE is that a user doesn't necessarily expect, nor can deal with, a raw bibentry field. I have tripped on the difference between [authors] and [author], and given how much time I have spent in that part of the source code, that is saying something. [authors] vs [AUTHOR] on the other hand.

Thefore, I raised Option A in my last comment, which I would prefer for strong typing. (This probably also raised https://github.com/JabRef/jabref/pull/2610#issuecomment-285385425).

grafik

The case sensitivity of the field name drives me mad somehow. - Sinse I don't see another solution and I think, the lower case fields are used more often then the plain ones, it is OK to go ahead.

Question: I can also use [year] as there is a fallback on the Bibentry plain fields, isn't it?

I believe it is #2610

Yeah, that's it.

With my doc update, the related issue https://github.com/koppor/jabref/issues/237 can be closed.

k3KAW8Pnf7mkmdSMPHz27 commented 3 years ago

Question: I can also use [year] as there is a fallback on the Bibentry plain fields, isn't it?

Yes. The "special field markers" are matched case-sensitive, while "bibentry plain fields" are matched case-insensitive. What you are seeing with upper case fields is the fallback, because it does not match a β€œspecial field marker” (yes, it is ugly).

Only upper case field names are used to access the fields of the Bibentry directly. Thus, I updated the documentation accordingly:

My bad, and thank you for the update. I guess I wanted to keep the original vocabulary that a "special field marker" is a "field marker" even if it is not a bibentry field, just based on a bibentry field.

k3KAW8Pnf7mkmdSMPHz27 commented 3 years ago

Thefore, I raised Option A in my last comment, which I would prefer for strong typing. (This probably also raised #2610 (comment)).

I don't understand "Option A". Do I interpret you correctly as in that all "special field markers" should be (re)implemented as a "field marker" followed by modifiers?


Huh, I was not aware of a difference between [title] and [TITLE]

I am not sure if it was supposed to be a difference, but it is. And I am trying to make it official by sneakily making people use it 🀫

Imho opinion, the current behavior isn't great, but I can't think of a better one. JabRef's purpose, in my opinion, is to hide the ugly truth of BibTeX/biblatex/latex from their users. Auto-generated filenames should be using latex-free Unicode whenever it is possible, and if the user doesn't want that, they should be able to access the text-fields themselves and do whatever they want with them. Therefore I view all lower-case field markers as "nicely behaved ones" and upper-case ones as "don't use unless you have to".

Long term, I'd like to break the behavior of [title], and replace it with [TITLE:latex_to_unicode:title_case] when it is used for file names. This will break the behavior of almost all the current bracketed patterns (because they don't resolve latex). I just haven't had time to submit any PR regarding this because my priorities regarding JabRef contributions are,

  1. Overleaf
  2. Update Groups for JavaFX
  3. Update BracketedPattern
koppor commented 3 years ago

Thefore, I raised Option A in my last comment, which I would prefer for strong typing. (This probably also raised #2610 (comment)). I don't understand "Option A". Do I interpret you correctly as in that all "special field markers" should be (re)implemented as a "field marker" followed by modifiers?

That's what I meant.

Imho opinion, the current behavior isn't great, but I can't think of a better one. JabRef's purpose, in my opinion, is to hide the ugly truth of BibTeX/biblatex/latex from their users. Auto-generated filenames should be using latex-free Unicode whenever it is possible, and if the user doesn't want that, they should be able to access the text-fields themselves and do whatever they want with them. Therefore I view all lower-case field markers as "nicely behaved ones" and upper-case ones as "don't use unless you have to".

We should put your text in the documentation πŸ‘. I could only write it shorter (but deleted it after reading your text).

Long term, I'd like to break the behavior of [title], and replace it with [TITLE:latex_to_unicode:title_case] when it is used for file names.

Random thoughts on that: Maybe, it will difficult to maintain if it behaves differently than when used at BibTeX keys. However, I think, it is good that BibTeX keys are not automatically Unicode. Because of bibtex. Maybe, the user has to do use :unicode_to_latex somehow... This somehow refs: https://github.com/JabRef/jabref/issues/160

All in all: Go ahead :)

Nevertheless, I would like to discuss title vs. camel with you. As programmer, I find it strange that "filler words" are just appended. In case they would be deleted, that would make sence. But just appended. Why not removing it at [title]? πŸ˜‡

1. Overleaf

This is nearly done. I was nearly finishing it. Please investigate https://github.com/koppor/jabref/pull/445. The main think, I was working on, is this comment: https://github.com/JabRef/jabref/pull/2866#issuecomment-388264343. - We can surely have a chat on that (gitter, skype, ..., ?)

Siedlerchr commented 3 years ago

Count me in for sharelatex,/overleaf I can help as well. I wrote most of the code back int the days.

k3KAW8Pnf7mkmdSMPHz27 commented 3 years ago

Regarding the bracketed patterns/citationkeys etc., my take-away is that when I can find the time for it (not anytime soon), I'll open up a PR expanding on what I think is reasonable, why, and what I'd like to do about it. I don't think there's much more to say that does not lead to needless details that might suit better in the context of an actual PR.

Nevertheless, I would like to discuss title vs. camel with you. As programmer, I find it strange that "filler words" are just appended. In case they would be deleted, that would make sence. But just appended. Why not removing it at [title]? πŸ˜‡

I am not sure of what you are referring to. I believe both [title] and [camel] only remove things. Could you give an example or context?

We should put your text in the documentation πŸ‘. I could only write it shorter (but deleted it after reading your text).

Regarding default/ADVANCED field usage? In my opinion, shorter text tend to be better. πŸ˜›

koppor commented 3 years ago

Nevertheless, I would like to discuss title vs. camel with you. As programmer, I find it strange that "filler words" are just appended. In case they would be deleted, that would make sence. But just appended. Why not removing it at [title]? πŸ˜‡ I am not sure of what you are referring to. I believe both [title] and [camel] only remove things. Could you give an example or context?

Sure. Do you know the term paperon?

[camel]: Capitalize and concatenate all the words of the title. For example, An awesome paper on JabRef becomes AnAwesomePaperOnJabref [title]: Capitalize all the significant words of the title, and concatenate them. For example, An awesome paper on JabRef becomes AnAwesomePaperonJabref

I find AnAwesomePaperOnJabref much more readble. For me, AwesomePaperJabref would also be OK. But not AnAwesomePaperonJabref. What is Paperon for a word?

k3KAW8Pnf7mkmdSMPHz27 commented 3 years ago

What is Paperon for a word?

Fair enough. We can look at this now or when/if I get time to address the bracketed pattern class. I don't really have a preference regarding it πŸ˜›

For me, AwesomePaperJabref would also be OK

Perhaps one could change the documentation/default to use [shorttitle] for citation keys? Perhaps implement [shorttitleN] where N is the number of words, and the default is 3 to maintain compatibility with the current use?

Nevertheless, I would like to discuss title vs. camel with you.

I am not very well versed on camel case/proper case/title case and their ilks. I think in this instance "proper case" would make us both happy-ish. It would become An Awesome Paper On Jabref when used to generate a file name, and since all spaces are removed for citation keys, it would automatically be converted to camel case. The issue would be honoring protective brackets, {JabRef}.