JabRef / jabref

Graphical Java application for managing BibTeX and biblatex (.bib) databases
https://devdocs.jabref.org
MIT License
3.53k stars 2.47k forks source link

BOM now missing at beginning of bibliography file -- causes JabRef to not recognize existing library #9496

Open andrewhw opened 1 year ago

andrewhw commented 1 year ago

JabRef version

5.8 (latest release)

Operating system

macOS

Details on version and operating system

Darwin daphne.local 22.2.0 Darwin Kernel Version 22.2.0: Fri Nov 11 02:03:51 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T6000 arm64

Checked with the latest development build

Steps to reproduce the behaviour

  1. Begin with an existing bibliography file
  2. Update to the newest JabRef
  3. Save the database (no edits required)
  4. You will likely get a warning that "the library has been modified by another program". This is not actually true. Dismiss changes.
  5. Examine the bibliography file using a text editor. The BOM (the bytes at the beginning of the file forming the Byte Order Mark) are now missing.
  6. Reopening the file using JabRef will now cause a "no content in table" error after opening.

Note that if you reestablish the BOM using an external editor and then open the file again using JabRef, all is well until the bibliography is saved again.

Note that this may be apparent on my machine because I have an ARM processor, so this error may not be reproducible on an older Mac with an Intel processor.

The underlying problem is simply that the BOM is now missing during write. Putting the BOM back in (as it was in older JabRef versions) will fix the problem.

Appendix

...

Log File ``` Paste an excerpt of your log file here ```
Siedlerchr commented 1 year ago

Thanks for reporting, does the bib file include a header line with % Encoding encoding? In general JabRef tries to detect the encoding for reading and will write in normal UTF8 if no header line is present Additionally, could you please provide the bib file for us for debugging? You can also send it privately to web@jabref.org

andrewhw commented 1 year ago

Yes, the bib file does include a % Encoding line. This now reads "% Encoding: UTF-16BE" however at the last update I had (when the BOM was working) this line read "% Encoding: UTF-16" (that is, without the "BE").

I have attached two bib files. The first, "tiny-1-withBOM.bib" works fine and can be successfully read by JabRef. If however you read this file and save it, it will then match "tiny-2-noBOM.bib". The difference between the files is simply the two 0xfeff bytes prior to the '%' beginning the header proper that are missing in the second one.

Thanks for looking into this. tiny-bib-example.zip

andrewhw commented 1 year ago

I just looked up what "UTF-16BE" is meant to mean, and the "BE" part is trying to flag that the file is "big endian".

The problem with this, in this context, is that the endianness of the file is required in order to correctly parse the 16-bit characters of the file, so without the BOM the "first" character (the "%" sign) will get loaded as character 0x2500 ("Box drawings light horizontal") rather than as 0x0025 ("percent").

The "% Encoding" strategy works well for UTF-8 as it is a orderless encoding (one byte processed at a time), but UTF-16 requires the order to be known before any characters are parsed at all.

Not sure if this helps, or if this is already obvious to everyone. Sorry if I am over-explaining.

Siedlerchr commented 1 year ago

Thanks for the additional information. For reference, we have been down that rabbit hole in https://github.com/JabRef/jabref/pull/8947 and https://github.com/unicode-org/icu/pull/2127

andrewhw commented 1 year ago

Great – thanks for letting me know!

andrewhw commented 1 year ago

In light of the examples in linked threads, maybe it is helpful to show the direct byte encodings in the files. I have shown them here with hexdump(1) and od(1) "octal dump" -- both of these are available command line tools under Linux and MacOSX.

byte-encodings-UTF-16-big-endian

Note the two bytes forming the BOM (0xFE 0xFF) shown prior to two byte sequence (<nul>-'%') forming the first readable Unicode character of the file.

koppor commented 1 year ago

Could you try the latest development version?

I think, this is a duplicate of https://github.com/JabRef/jabref/issues/9926, which was fixed recently.

andrewhw commented 1 year ago

Hi Oliver,

As a side note, I tried getting the latest development version for MacOSX using the .dmg file and the resulting application as installed was corrupted. I installed the .gz version and it is fine.

Having installed the .gz version, I think that the issue is fixed?

The current behaviour seems to be that it reads UTF16BE files if they have a BOM, but the ones that it previously created without the BOM (that I would argue can be seen as invalid) are broken.

This will orphan anyone who used the previous version with UTF16BE files previous to the last major release, and they will need to update their files externally -- as long as everyone understands that, then I think we are all on the same page.

Thanks for getting this fixed.

Andrew

From: Oliver Kopp @.> Date: Tuesday, June 6, 2023 at 15:35 To: JabRef/jabref @.> Cc: Andrew Wright @.>, Author @.> Subject: Re: [JabRef/jabref] BOM now missing at beginning of bibliography file -- causes JabRef to not recognize existing library (Issue #9496) CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to @.***

Could you try the latest development version?

I think, this is a duplicate of #9926https://github.com/JabRef/jabref/issues/9926, which was fixed recently.

— Reply to this email directly, view it on GitHubhttps://github.com/JabRef/jabref/issues/9496#issuecomment-1579338406, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABPDAX2UKB63NOCAZVP7N3LXJ6BAZANCNFSM6AAAAAATIQLGVA. You are receiving this because you authored the thread.Message ID: @.***>

koppor commented 1 year ago

Regarding the Mac OS X bug, there is a work around: https://github.com/JabRef/jabref/issues/9553

koppor commented 1 year ago

Note to us: There was a fix on May 20 (https://github.com/JabRef/jabref/pull/9927), but at the comment on June, it said, some files can be broken. We need

andrewhw commented 1 year ago

If you are referring to the files I uploaded in the tiny-bib-example.zip file on Dec 24, 2022 above, then the test cases are simply this:

Expected behaviour (as far as I understand it):

Is that what you need?

andrewhw commented 1 year ago

If it is helpful, here are the "tiny" files in both big and little endian formats, with and without BOM markers.

tiny-bib-example-endian-and-BOM-combinations.zip