Algorithms: ArrayIndex out of bounds: 512x512x46 px, 32 bit images input (184 MB)

fiji / Colocalisation_Analysis

Fiji's plugin for colocalization analysis

http://imagej.net/Coloc_2

GNU General Public License v3.0

24 stars 18 forks source link

Algorithms: ArrayIndex out of bounds: 512x512x46 px, 32 bit images input (184 MB) #37

Open chalkie666 opened 9 years ago

chalkie666 commented 9 years ago

@ctrueden an opinion would be great here... then i will do as needed.

two input images - quite large: 512x512x46 px, 32 bit (184 MB)

Looks like a long isnt big enough to count all the pixels????? Should we use BigInteger?

run coloc_2, in 2D Histograms step get:

(Fiji Is Just) ImageJ 2.0.0-rc-34/1.50a; Java 1.7.0_79 [64-bit]; Linux 3.19.0-21-generic; 2132MB of 2421MB (88%)

java.lang.ArrayIndexOutOfBoundsException: 65538 at net.imglib2.img.basictypeaccess.array.LongArray.getValue(LongArray.java:63) at net.imglib2.type.numeric.integer.LongType.get(LongType.java:116) at net.imglib2.type.numeric.integer.LongType.getIntegerLong(LongType.java:133) at algorithms.Histogram2D.generateHistogramData(Histogram2D.java:207) at algorithms.LiHistogram2D.execute(LiHistogram2D.java:95) at Coloc_2.colocalise(Coloc_2.java:447) at Coloc_2.run(Coloc_2.java:160) at ij.IJ.runUserPlugIn(IJ.java:212) at ij.IJ.runPlugIn(IJ.java:176) at ij.Executer.runCommand(Executer.java:132) at ij.Executer.run(Executer.java:65) at java.lang.Thread.run(Thread.java:745)

ctrueden commented 9 years ago

Looks like a long isnt big enough to count all the pixels?????

Nope, looks like a bug in Coloc 2's Histogram2D class, right here. The getXValue and getYValue methods are probably returning invalid positions.

One way to "fix" this bug would be to ditch Coloc 2's histogram code in favor of the net.imglib2.histogram package (built in to ImgLib2 core).

chalkie666 commented 9 years ago

Thanks @ctrueden That sounds like a good plan indeed!

chalkie666 commented 9 years ago

hmmmmm, I don't get why getXValue could be returning a bad value that doesnt fit in a the long data type. Its getting the number from imglib2 :

207: long count = histogram2DCursor.get().getIntegerLong();

is getting the vlaue of the 2D histogram image we are contructing at that pair of x y histogram image coordinates... so we can increment it by 1.

java.lang.ArrayIndexOutOfBoundsException: 65538 at net.imglib2.img.basictypeaccess.array.LongArray.getValue(LongArray.java:63) ..........

65538 is suspiciously close to the 16 bit limit... but a long is +- 2 pow 63 or am i on totally the wrong track here?

I need to learn how imglib2 really does this data access.....

ctrueden commented 9 years ago

Yeah, it is not related to long data type at all.

The issue is with array indexing—trying to access an XY coordinate outside the bounds of the image.

If you need help debugging, can you make an MVCE?

chalkie666 commented 9 years ago

@tomka So this method results in, in some cases, an index to an array of yBins = 256 elements being addressed out of bounds.

/**
 * Calculates the bin width of one bin in x/ch1 direction.
 * @param container The container with images to work on
 * @return The width of one bin in x direction
 */
protected double getXBinWidth(DataContainer<T> container) {
    double ch1Max = getMaxCh1(container);
    if (ch1Max < yBins) {
        // bin widths must not exceed 1
        return 1;
    }
    // we need (ch1Max * width + 0.5) < yBins, but just so, i.e.
    // ch1Max * width + 0.5 == yBins - eps
    // width = (yBins - 0.5 - eps) / ch1Max
    return (yBins - 0.50001) / ch1Max;
}

the return statement seems intuitively wrong as the bin width will be smaller than they should be.... (yBins / ch1Max) - 0.50001 seems more likely to be right? but i dont understand the maths in the comment and i dont get why we need to shift by a half. What is meant by eps?

width here seems to be the inverse of what might naturally be described as a width, or a range. width would naturally be ch1Max / yBins?????? or am i just confused.

tomka commented 9 years ago

EPS represents machine precision and is used to represent the rounding error in floating point numbers. You can usually only say two floating point numbers are very close, but not the same. This is why you check if a floating point number a is zero like this: isZero = abs(a) < EPS. EPS is usually something like 0.0001, but depends on the actual data type (see link above).

Well, if the the condition in the comment ((ch1Max * width + 0.5) < yBins) is true, the returned value is, too. The comment says ch1Max * width + 0.5 should be just below yBins. This means it must be equal to yBins - eps, i.e. yBins minus the precision correction. The equation is correctly re-arranged and eps is replaced by 0.0001 as mentioned above.

So the question is more if the the condition or the general bin width representation is actually correct.

tomka commented 9 years ago

In general the histogram class creates a grid of xBins * yBins bins (set to 256*256). If the maximum value of the image used (channel 1 for X and channel 2 for Y, if not swapped) is ~~larger~~ smaller than xBins and yBins, respectively, bins are not scaled but set to have a width of one (first test in getXBinWidth() and getYBinWidth()), i.e.g use all 256 bins for images with a smaller dynamic range.

If the maximum value of an input image is ~~smaller~~ larger or equal the number of bins, bins are made wider so that all bins are used. For getYBinWidth() this means we say one image data value spans (max/yBins) bins. So if the maximum for channel two would be 128, the bin width would be 128/256, i.e. 0.5 and each image value is mapped to every second bin, effectively skipping one per data value.

Now for getXBinWidth() it is strange indeed that it refers to yBins. I think it should refer to xBins. But given that they are both set to 256 this doesn't really matter and doesn't cause the issue discussed here.

Oh and the shift by a half: This is used, I think, to avoid rounding later when those values get used. You see, the value returned by getYBinWidth() is used with getYValue(). In there, the pixel value is binned into the histogram's Y dimension: (int)(ch1Val * ch1BinWidth + 0.5). Each value is multiplied by the bin width and then put into the resulting bin (well, in the code the Y binning is also mirrored by subtracting the result from yBins - 1). Adding 0.5 is a trick to have faster rounding: Instead of writing Math.round(ch1Val * ch1BinWidth), you could also cast directly to int and cut away any decimals if you add 0.5 beforehand. For instance (int)(0.7) will yield 0, but (int)(0.7 + 0.5) will yield 1. While (int)(0.4 + 0.5) will yield 0.

And this is also the reason the bin width uses this 0.50001, which is 0.5 + eps: For 256 bins and a max Y value of say 511, we would get a y bin width of (256 - 0.50001 / 511) = 0.499999. Without subtracting 0.50001 we get 0.50097). This, however,would cause a value of 511 to be mapped to the wrong bin index: (int)(511 * 0.50097 + 0.5) yields 256, while the last index is 255. We get the correct index of 255 if we subtract the 0.5 (compensated for rounding errors): (int)(511 * 0.49999 + 0.5).

However, I agree that this logic is hard to read and might introduce problems like the one discussed in this issue. I might be able to squeeze in some time at the end of this week to see what causes this out of bounds exception, if you didn't solve it until then.

chalkie666 commented 9 years ago

@tomka i think you meant: If the maximum value of the image used (channel 1 for X and channel 2 for Y, if not swapped) is SMALLER than xBins and yBins, respectively, bins are not scaled but set to have a width of one ???? that is what the code says.... if (ch1Max < yBins) {

Do we need to shift by 0.5? Do we need to roubf up? Or just always round down? Should we round 0.6 to 1, or 0? I would say 0.... its a histogram, so values from 0-0.999999 go in thee first bin, then values from 1-1.999999 go in the second bin

I wonder what the imagej2 histogram class does... which we should migrate to when time allows.... i will have a look.

Is bin width in units of intensity or 1/intensity? Should be intensity, i think.... so return (yBins - 0.50001) / ch1Max; feels weird... where return ch1Max / (yBins - 0.50001) ; seems to make more sense.... i cant think of a good word to describe 1/binWidth.... binFrequency? but binWidth should be a range of intensities, right? l eg 13.4567-23.456678

tomka commented 9 years ago

i think you meant: If the maximum value of the image used (channel 1 for X and channel 2 for Y, if not > swapped) is SMALLER than xBins and yBins, respectively, bins are not scaled but set to have a width of one ???? that is what the code says.... if (ch1Max < yBins) {

Oh well, yes that's what I meant. Thanks for the correction and sorry for the confusion. I'll fix it above as well.

Do we need to shift by 0.5?

With the current implementation, yes. See example above for why.

Do we need to roubf up? Or just always round down? Should we round 0.6 to 1, or 0? I would say 0....

Currently we round [0,0.5[ to 0 and [0.5,1[ to 1. But this happens after a potentially way larger number was mapped to the bins, so we should keep it like this.

its a histogram, so values from 0-0.999999 go in thee first bin, then values from 1-1.999999 go in the second bin

Well, with resepect to floating point precision limits (eps), 0.999999 would be considered as 1. So we have to do rounding anyway to deal with precision limits. And since we usually map much larger values into our 256 bins, rounding makes perfect sense in me.

Is bin width in units of intensity or 1/intensity? Should be intensity, i think.... so return (yBins - 0.50001) / ch1Max; feels weird... where return ch1Max / (yBins - 0.50001) ; seems to make more sense.... i cant think of a good word to describe 1/binWidth.... binFrequency? but binWidth should be a range of intensities, right? l eg 13.4567-23.456678

No, binWidth is merely a scaling factor. It will always be between zero and one. It is a used to scale larger values into our histogram range ([0,255]). One example I gave above mapped data with a max value of 511 to the 256 bin histogram. The resulting bin width (scaling) was 0.499999. So every input value has to be scaled by this to linearly map into the histogram.

And I don't see a reason why the inverse scaling factor would make more sense. Especially given that we currently can multiply it (cheap operation) with an actual input value rather than dividing by it (more expensive).

chalkie666 commented 9 years ago

OK. So I was just confused by the word width in This sense. If we change it to binScaleFactor it all makes sense.

;-) On 19 Aug 2015 19:22, "Tom Kazimiers" notifications@github.com wrote:

i think you meant: If the maximum value of the image used (channel 1 for X and channel 2 for Y, if not > swapped) is SMALLER than xBins and yBins, respectively, bins are not scaled but set to have a width of one ???? that is what the code says.... if (ch1Max < yBins) {

Oh well, yes that's what I meant. Thanks for the correction and sorry for the confusion. I'll fix it above as well.

Do we need to shift by 0.5?

With the current implementation, yes. See example above for why.

Do we need to roubf up? Or just always round down? Should we round 0.6 to 1, or 0? I would say 0....

Currently we round [0,0.5[ to 0 and [0.5,1[ to 1. But this happens after a potentially way larger number was mapped to the bins, so we should keep it like this.

its a histogram, so values from 0-0.999999 go in thee first bin, then values from 1-1.999999 go in the second bin

Well, with resepect to floating point precision limits (eps), 0.999999 would be considered as 1. So we have to do rounding anyway to deal with precision limits. And since we usually map much larger values into our 256 bins, rounding makes perfect sense in me.

Is bin width in units of intensity or 1/intensity? Should be intensity, i think.... so return (yBins - 0.50001) / ch1Max; feels weird... where return ch1Max / (yBins - 0.50001) ; seems to make more sense.... i cant think of a good word to describe 1/binWidth.... binFrequency? but binWidth should be a range of intensities, right? l eg 13.4567-23.456678

No, binWidth is merely a scaling factor. It will always be between zero and one. It is a used to scale larger values into our histogram range ( [0,255]). One example I gave above mapped data with a max value of 511 to the 256 bin histogram. The resulting bin width (scaling) was 0.499999. So every input value has to be scaled by this to linearly map into the histogram.

And I don't see a reason why the inverse scaling factor would make more sense. Especially given that we currently can multiply it (cheap operation) with an actual input value rather than dividing by it (more expensive).

— Reply to this email directly or view it on GitHub https://github.com/fiji/Colocalisation_Analysis/issues/37#issuecomment-132701546 .