[bug] Fails to read Bioformats units with micro (µ) symbol

Antyos commented 1 year ago

Component PyDIP v3.4.0 with Bioformats installed.

Describe the bug When I try to open an .nd2 (part of Bioformats), I get the following error:

>>> import diplib as dip
>>> dip.ImageRead(R"N:\Data\Andrew\2023-07-14_Uniaxial-A-3T3\StaticGel-Timelapse.nd2")
Ill-formed Units string
in function: Units (D:\a\diplib\diplib\src\library\physical_dimensions.cpp at line number 261) while converting `┬╡m'
Ill-formed Units string
in function: Units (D:\a\diplib\diplib\src\library\physical_dimensions.cpp at line number 261) while converting `┬╡m'
Ill-formed Units string
in function: Units (D:\a\diplib\diplib\src\library\physical_dimensions.cpp at line number 261) while converting `┬╡m'
Exception in thread "main" java.lang.IllegalArgumentException: capacity < 0: (-2012217344 < 0)
        at java.base/java.nio.Buffer.createCapacityException(Buffer.java:279)
        at java.base/java.nio.Buffer.<init>(Buffer.java:242)
        at java.base/java.nio.ByteBuffer.<init>(ByteBuffer.java:288)
        at java.base/java.nio.ByteBuffer.<init>(ByteBuffer.java:296)
        at java.base/java.nio.MappedByteBuffer.<init>(MappedByteBuffer.java:112)
        at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:170)
        at org.diplib.Image.Origin(Native Method)
        at org.diplib.Image.Origin(Image.java:135)
        at org.diplib.BioFormatsInterface.Read(BioFormatsInterface.java:115)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Reading JavaIO file: error calling interface class's Read method
in function: ImageReadJavaIO (D:\a\diplib\diplib\javaio\src\javaio.cpp at line number 140)
in function: ImageRead (D:\a\diplib\diplib\include\diplib/simple_file_io.h at line number 103)

It seems to me that this is an issue with encoding. As noted in this Stackoverflow answer,

print('abcd kΩ ☠ °C √Hz µF ü ☃ ♥')

Here, the string is a byte string. Because the encoding of your source file is UTF-8, the bytes are

'abcd k\xce\xa9 \xe2\x98\xa0 \xc2\xb0C \xe2\x88\x9aHz \xc2\xb5F \xc3\xbc \xe2\x98\x83 \xe2\x99\xa5'

The print statement writes these bytes to the console as-is. But the Windows console interprets byte strings as being encoded in the "OEM" code page, which in the US is 437. So the string you actually see on your screen is

abcd k╬⌐ Γÿá ┬░C ΓêÜHz ┬╡F ├╝ Γÿâ ΓÖÑ

Where the characters ┬╡ correspond to µ. Using Code Page 437, ┬╡ translate to C2B5 in hex, which leads me to believe that it is specifically the "micro" symbol (U+00B5) rather than the Greek mu, (U+03BC), despite the two often using the same glyph in certain fonts. Note that both glyphs should be recognized as the "micro" prefix..

I do not know enough about Bioformats to know if/how they specify encoding, or if all supported file types use UTF-8.

Also, why do all the JavaIO paths point to D:\a\diplib\...? I don't even have a D: drive connected.

To Reproduce Hopefully the above description is adequate. I will try to find a sample .nd2 file that I can share here for testing (that file in particular happens to be about 50gb)

System information:

OS: Windows 10
Python version: 3.11.0

crisluengo commented 1 year ago

Is this with the official build on PyPI, or did you build this yourself?

The Windows build on PyPI has Unicode disabled because Windows doesn’t play nice with UTF-8. I’m wondering why it’s trying to use a UTF-8 encoded string.

The file names in the error messages refer to the source code file names as they were when the library was built. If you downloaded binaries, they won’t match file names on your system. You can ignore that.

Antyos commented 1 year ago

Is this with the official build on PyPI, or did you build this yourself?

This was with the official build on PyPI.

The Windows build on PyPI has Unicode disabled because Windows doesn’t play nice with UTF-8. I’m wondering why it’s trying to use a UTF-8 encoded string.

Do you mean why the the library or the .nd2 file is trying to use UTF-8?

The file names in the error messages refer to the source code file names as they were when the library was built. If you downloaded binaries, they won’t match file names on your system. You can ignore that.

That clears up my confusion, thanks.

crisluengo commented 1 year ago

@wcaarls It seems that BioFormats produces the UTF-8 encoded string with the micro symbol in it. This will work correctly on Linux and macOS, but under Windows is not parsed correctly because we disable the Unicode parsing (which used to break with MSVC when I tested it, maybe newer versions work?).

So I guess javaio needs to somehow catch this case in the Java component (because we can’t do it in C++ on Windows), and replace it with a 'u'. Can this be done in Java? Does Java handle UTF-8 properly on Windows?

Antyos commented 1 year ago

I will defer to your judgement about any bugs regarding C++, MSVC, and UTF-8. Though for whatever it may be worth, I ran across this StackOverflow answer which notes at the end that u8"" strings which were buggy in MSVC 19.14 work in MSVC 19.22.27905.

If Java can't do UTF-8 either, you could always hard code the character replacement of the false decoding of the two characters into a 'u', though I imagine that is far from an ideal solution.

wcaarls commented 1 year ago

I think it should be possible to do, say in PhysicalQuantity.java, but I'm not sure that is the breaking bug here. If the units are unknown, they should be set to the default. I'm suspecting the image is very large, and there is a signed 32-bit overflow somewhere. Perhaps in BioFormatsInterface.java:65 (int should be long)?

@Antyos Does the loading still crash for a smaller image?

Antyos commented 1 year ago

I am able to open up some .nd2 files, but I was able to repeat the issue with this 5mb one.

Representative Image - 100um.zip

wcaarls commented 1 year ago

With Python 3.11 and the latest diplib from PyPI, I can open the attached file without crashing. The units are wrong, but apart from that the file loads correctly. Does it throw a RuntimeError for you?

Antyos commented 1 year ago

I updated to Python 3.11.4 and it still displays the "Ill-formed Units string" warning, but it does load that sample image otherwise. However, trying to open a much larger file (~52Gb) still raises the same RuntimeError. I suspect you are right that there is likely a signed 32-bit int overflow in the file size.

It looks like there are actually two issues here. Should we split this issue thread?

crisluengo commented 1 year ago

@wcaarls Have you been able to figure out anything else about this 32-bit overflow issue?

I was just looking at the bio-formats documentation, and saw that there's a lot of int in function declarations, for example:

public int getSizeX()

This issue might just be a limitation of bio-formats, with little we can do to solve it?

BioFormatsInterface.java:65 loops from 0 to 5 (the max 5 image dimensions), I don't think there's any overflow there. :) There's also an int on line 117, which loops over reader.getImageCount(). That is also not going to overflow (and getImageCount returns an int also).

Oh. It seems that Java uses int for array sizes and indexing, so arrays cannot hold more than 2^31-1 elements. So this would be a limitation in the language, and not a problem specific to bio-formats (the only thing I blame the authors for is choosing to use Java to build their awesome library).

I guess I'll have to add a warning to the documentation about the maximum image size that can be read through bio-formats.

wcaarls commented 1 year ago

I haven't looked at it yet. In line 65, oo is fine, but stride can become very large.

crisluengo commented 1 year ago

At least for TIFF files, bio-formats has a test to ensure the image sizes fit in an int:

import diplib as dip
data = dip.Image((2**31 + 1000, 1), dt="UINT8")
dip.ImageWriteTIFF(data, "test.tif")
dip.ImageRead("test.tif", "bioformats")

produces:

Exception in thread "main" loci.formats.FormatException: Sorry, ImageWidth > 2147483647 is not supported.
    at loci.formats.tiff.IFD.getImageWidth(IFD.java:539)
    at loci.formats.in.MinimalTiffReader.initFile(MinimalTiffReader.java:490)
    at loci.formats.in.BaseTiffReader.initFile(BaseTiffReader.java:609)
    at loci.formats.FormatReader.setId(FormatReader.java:1443)
    at loci.formats.in.TiffDelegateReader.setId(TiffDelegateReader.java:92)
    at loci.formats.ImageReader.setId(ImageReader.java:849)
    at org.diplib.BioFormatsInterface.Read(BioFormatsInterface.java:45)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.11/site-packages/diplib/__init__.py", line 200, in ImageRead
    return javaio_imageread(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Reading JavaIO file: error calling interface class's Read method
in function: dip::FileInformation dip::javaio::ImageReadJavaIO(dip::Image&, const dip::String&, const dip::String&) (/Users/cris/src/diplib/javaio/src/javaio.cpp at line number 140)
in function: dip::FileInformation dip::ImageRead(Image&, const String&, String) (/Users/cris/src/diplib/include/diplib/simple_file_io.h at line number 120)

I don't know how to create a large nd2 file.

crisluengo commented 1 year ago

Hummm... With an ICS file you get the error message

Exception in thread "main" java.lang.IllegalArgumentException: 0 must be non-null and strictly positive.

That's... not very informative. I guess each file format in bio-formats is a separate bit of code needing to do its own checks. And not all of them do that properly.

wcaarls commented 1 year ago

The 32-bit limit could perhaps be circumvented by creating multiple ByteByffers, with different offsets from Image.Origin, but that requires some more work...

wcaarls commented 1 year ago

@crisluengo I committed a change that should allow loading large files. However, you need to make sure a single plane is not too large, otherwise Java runs out of heap space. Please test!

crisluengo commented 1 year ago

@wcaarls I tried this:

import diplib as dip
data = dip.Image((2**14, 2**14, 10), dt="UINT8")
dip.ImageWriteICS(data, "test.ics")
dip.ImageRead("test.ics", "bioformats")

Before your commit it raised:

Exception in thread "main" java.lang.IllegalArgumentException: capacity < 0: (-1610612736 < 0)

After the commit it succeeds.

Next, I repeated the same, but with random values in data, and verified that the image read back is identical. It is!

import diplib as dip
data = dip.Image((2**14, 2**14, 10), dt="UINT8")
data = dip.UniformNoise(data, lowerBound=0, upperBound=255)
dip.ImageWriteICS(data, "test.ics")
data2 = dip.ImageRead("test.ics", "bioformats")
print(dip.All(data == data2)[0][0])

crisluengo commented 1 year ago

So I'm guessing we now require that the image has each dimension < 2³¹-1, and additionally each x-y plane smaller than that number as well. I will update the documentation to reflect this.

Thank you!

crisluengo commented 1 year ago

Let's not lose track of the µ issue on Windows.

If Java uses Unicode strings (I presume it does), then this is just a matter of replacing all occurrences of "µ" to "u" in the units string. I guess PhysicalQuantity.java:30 is the easy place to do this.

What's Java's function for character replacement?

DIPlib / diplib

[bug] Fails to read Bioformats units with micro (µ) symbol #133