Closed Antyos closed 1 year ago
Is this with the official build on PyPI, or did you build this yourself?
The Windows build on PyPI has Unicode disabled because Windows doesn’t play nice with UTF-8. I’m wondering why it’s trying to use a UTF-8 encoded string.
The file names in the error messages refer to the source code file names as they were when the library was built. If you downloaded binaries, they won’t match file names on your system. You can ignore that.
Is this with the official build on PyPI, or did you build this yourself?
This was with the official build on PyPI.
The Windows build on PyPI has Unicode disabled because Windows doesn’t play nice with UTF-8. I’m wondering why it’s trying to use a UTF-8 encoded string.
Do you mean why the the library or the .nd2
file is trying to use UTF-8?
The file names in the error messages refer to the source code file names as they were when the library was built. If you downloaded binaries, they won’t match file names on your system. You can ignore that.
That clears up my confusion, thanks.
@wcaarls It seems that BioFormats produces the UTF-8 encoded string with the micro symbol in it. This will work correctly on Linux and macOS, but under Windows is not parsed correctly because we disable the Unicode parsing (which used to break with MSVC when I tested it, maybe newer versions work?).
So I guess javaio needs to somehow catch this case in the Java component (because we can’t do it in C++ on Windows), and replace it with a 'u'. Can this be done in Java? Does Java handle UTF-8 properly on Windows?
I will defer to your judgement about any bugs regarding C++, MSVC, and UTF-8. Though for whatever it may be worth, I ran across this StackOverflow answer which notes at the end that u8""
strings which were buggy in MSVC 19.14 work in MSVC 19.22.27905.
If Java can't do UTF-8 either, you could always hard code the character replacement of the false decoding of the two characters into a 'u'
, though I imagine that is far from an ideal solution.
I think it should be possible to do, say in PhysicalQuantity.java
, but I'm not sure that is the breaking bug here. If the units are unknown, they should be set to the default. I'm suspecting the image is very large, and there is a signed 32-bit overflow somewhere. Perhaps in BioFormatsInterface.java:65
(int should be long)?
@Antyos Does the loading still crash for a smaller image?
I am able to open up some .nd2
files, but I was able to repeat the issue with this 5mb one.
With Python 3.11 and the latest diplib from PyPI, I can open the attached file without crashing. The units are wrong, but apart from that the file loads correctly. Does it throw a RuntimeError
for you?
I updated to Python 3.11.4 and it still displays the "Ill-formed Units string" warning, but it does load that sample image otherwise. However, trying to open a much larger file (~52Gb) still raises the same RuntimeError
. I suspect you are right that there is likely a signed 32-bit int overflow in the file size.
It looks like there are actually two issues here. Should we split this issue thread?
@wcaarls Have you been able to figure out anything else about this 32-bit overflow issue?
I was just looking at the bio-formats documentation, and saw that there's a lot of int
in function declarations, for example:
public int getSizeX()
This issue might just be a limitation of bio-formats, with little we can do to solve it?
BioFormatsInterface.java:65
loops from 0 to 5 (the max 5 image dimensions), I don't think there's any overflow there. :)
There's also an int
on line 117, which loops over reader.getImageCount()
. That is also not going to overflow (and getImageCount
returns an int
also).
Oh. It seems that Java uses int
for array sizes and indexing, so arrays cannot hold more than 2^31-1 elements. So this would be a limitation in the language, and not a problem specific to bio-formats (the only thing I blame the authors for is choosing to use Java to build their awesome library).
I guess I'll have to add a warning to the documentation about the maximum image size that can be read through bio-formats.
I haven't looked at it yet. In line 65, oo
is fine, but stride
can become very large.
At least for TIFF files, bio-formats has a test to ensure the image sizes fit in an int
:
import diplib as dip
data = dip.Image((2**31 + 1000, 1), dt="UINT8")
dip.ImageWriteTIFF(data, "test.tif")
dip.ImageRead("test.tif", "bioformats")
produces:
Exception in thread "main" loci.formats.FormatException: Sorry, ImageWidth > 2147483647 is not supported.
at loci.formats.tiff.IFD.getImageWidth(IFD.java:539)
at loci.formats.in.MinimalTiffReader.initFile(MinimalTiffReader.java:490)
at loci.formats.in.BaseTiffReader.initFile(BaseTiffReader.java:609)
at loci.formats.FormatReader.setId(FormatReader.java:1443)
at loci.formats.in.TiffDelegateReader.setId(TiffDelegateReader.java:92)
at loci.formats.ImageReader.setId(ImageReader.java:849)
at org.diplib.BioFormatsInterface.Read(BioFormatsInterface.java:45)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/homebrew/lib/python3.11/site-packages/diplib/__init__.py", line 200, in ImageRead
return javaio_imageread(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Reading JavaIO file: error calling interface class's Read method
in function: dip::FileInformation dip::javaio::ImageReadJavaIO(dip::Image&, const dip::String&, const dip::String&) (/Users/cris/src/diplib/javaio/src/javaio.cpp at line number 140)
in function: dip::FileInformation dip::ImageRead(Image&, const String&, String) (/Users/cris/src/diplib/include/diplib/simple_file_io.h at line number 120)
I don't know how to create a large nd2 file.
Hummm... With an ICS file you get the error message
Exception in thread "main" java.lang.IllegalArgumentException: 0 must be non-null and strictly positive.
That's... not very informative. I guess each file format in bio-formats is a separate bit of code needing to do its own checks. And not all of them do that properly.
The 32-bit limit could perhaps be circumvented by creating multiple ByteByffer
s, with different offsets from Image.Origin
, but that requires some more work...
@crisluengo I committed a change that should allow loading large files. However, you need to make sure a single plane is not too large, otherwise Java runs out of heap space. Please test!
@wcaarls I tried this:
import diplib as dip
data = dip.Image((2**14, 2**14, 10), dt="UINT8")
dip.ImageWriteICS(data, "test.ics")
dip.ImageRead("test.ics", "bioformats")
Before your commit it raised:
Exception in thread "main" java.lang.IllegalArgumentException: capacity < 0: (-1610612736 < 0)
After the commit it succeeds.
Next, I repeated the same, but with random values in data
, and verified that the image read back is identical. It is!
import diplib as dip
data = dip.Image((2**14, 2**14, 10), dt="UINT8")
data = dip.UniformNoise(data, lowerBound=0, upperBound=255)
dip.ImageWriteICS(data, "test.ics")
data2 = dip.ImageRead("test.ics", "bioformats")
print(dip.All(data == data2)[0][0])
So I'm guessing we now require that the image has each dimension < 231-1, and additionally each x-y plane smaller than that number as well. I will update the documentation to reflect this.
Thank you!
Let's not lose track of the µ issue on Windows.
If Java uses Unicode strings (I presume it does), then this is just a matter of replacing all occurrences of "µ" to "u" in the units string. I guess PhysicalQuantity.java:30
is the easy place to do this.
What's Java's function for character replacement?
Component PyDIP v3.4.0 with Bioformats installed.
Describe the bug When I try to open an
.nd2
(part of Bioformats), I get the following error:It seems to me that this is an issue with encoding. As noted in this Stackoverflow answer,
Where the characters
┬╡
correspond toµ
. Using Code Page 437,┬╡
translate toC2B5
in hex, which leads me to believe that it is specifically the "micro" symbol (U+00B5
) rather than the Greek mu, (U+03BC
), despite the two often using the same glyph in certain fonts. Note that both glyphs should be recognized as the "micro" prefix..I do not know enough about Bioformats to know if/how they specify encoding, or if all supported file types use UTF-8.
Also, why do all the JavaIO paths point to
D:\a\diplib\...
? I don't even have aD:
drive connected.To Reproduce Hopefully the above description is adequate. I will try to find a sample
.nd2
file that I can share here for testing (that file in particular happens to be about 50gb)System information: