Cached value of CvMat.fullSize and *Buffer fields might get incorrect

GoogleCodeExporter commented 9 years ago

Consider the following code:

  CvMat mat = CvMat.createHeader(10, 10, CV_8U);
  cvGetSubRect(largerImage, mat, cvRect(100, 100, 10, 10));

Now mat.getByteBuffer() returns a buffer with limit=100, while it should be 
close to largerImage.fullSize, as mat now refers to the data of a larger image.

To work the problem around call reset() before calling getByteBuffer().

I use JavaCV 0.5.

I think you should not cache fullSize inside of CvMat at all, as you never 
know, when it changes. Various oprations reallocate data, if the target mat 
does not fit the expected dimensions/type, which in turn changes the fullSize.

Original issue reported on code.google.com by viliam.d...@gmail.com on 13 May 2013 at 11:25

GoogleCodeExporter commented 9 years ago

Thanks for reporting this issue! I hadn't really noticed that problem before, 
but yes, you are right, this is a problem with C++ functions.

Actually, the same thing can happen with the *Buffer fields. I'm not sure how 
to handle this efficiently. Any bright ideas?

Original comment by samuel.a...@gmail.com on 26 May 2013 at 11:34

Changed title: *Cached value of CvMat.fullSize and Buffer fields might get incorrect**
Changed state: Accepted

GoogleCodeExporter commented 9 years ago

I would remove all the fields, because they can't be updated reliably. The 
getXxxBuffer() methods should always return fresh Buffer instance, with freshly 
calculated capacity. We could encourage users to reuse *Buffer instance. I 
would deprecate fullSize() and reset() methods.

By the way, CvMat.size() calculation is also incorrect in case of subMat. It 
should be:

    public int size() {
        int rows = rows();
        return cols()*elemSize()*channels() + (rows > 1 ? step()*(rows-1) : 0);
    }

For example consider:

    CvMat mat = CvMat.create(10, 10, CV_8U);
    CvMat matSub = CvMat.createHeader(5, 5, CV_8U);
    cvGetSubRect(mat, matSub, cvRect(5, 5, 5, 5));

The size of data of mat is 100 bytes, step is 10 (let's ignore alignment). 
matSub's data point to mat.data + 55, that is the same data, offset by upper 
left corner of the rectangle. Current version of size() method will return 50 
for matSub, which will go beyond the mat.data length, because 50+55 > 100. This 
problem is minor, as it does not affect correctly implemented programs, it only 
allows buffer overflow errors, that are otherwise prevented.

Original comment by viliam.d...@gmail.com on 28 May 2013 at 4:52

GoogleCodeExporter commented 9 years ago

Creating a Buffer is pretty expensive, and isn't that great for the garbage 
collector either, so creating one for each call to CvMat.put() isn't an option. 
That's why I'm doing this caching in the first place. It wasn't a problem with 
the simpler (IMHO better) C API... Any thoughts?

Thanks for the the size() fix! I'll put that in :)

Original comment by samuel.a...@gmail.com on 2 Jun 2013 at 1:04

GoogleCodeExporter commented 9 years ago

I would deprecate all get() and put() methods, as they are slow and might get 
incorrect without reset().

When I started to use JavaCV, I began to use CvMat.get and .put methods, but 
soon stopped to, as it is much slower when compared to using the Buffer 
directly. JNI calls are expensive too, and the put(int i, double v) makes one 
call and put(int i, int j, double v) makes three more. Next, there is 
unnecessary conversion from double and back, which is not free (maybe it is 
optimized out by JIT, I did not test).

Compare these snippets. They all do one get, multiply and put per pixel. The 
first uses the (i, j) version, second uses the (i) version, the third uses the 
Buffer directly and the fourth copies the buffer to Java array and back:

public static void testIJ() {
    long start = System.nanoTime();
    CvMat mat = CvMat.create(1000, 1000, CV_8U);
    for (int i=0; i<1000; i++)
        for (int j=0; j<1000; j++) {
            byte val = (byte) mat.get(i, j);
            val = (byte) (2*val);
            mat.put(i, j, val);
        }
    System.out.println((System.nanoTime() - start) / 1000000 + " ms");
}

public static void testI() {
    long start = System.nanoTime();
    CvMat mat = CvMat.create(1000, 1000, CV_8U);
    int step = mat.step() / mat.elemSize();
    int channels = mat.nChannels();

    for (int i=0; i<1000; i++)
        for (int j=0; j<1000; j++) {
            byte val = (byte) mat.get(step + j*channels);
            val = (byte) (2*val);
            mat.put(step + j*channels, val);
        }
    System.out.println((System.nanoTime() - start) / 1000000 + " ms");
}

public static void testDirect() {
    long start = System.nanoTime();
    CvMat mat = CvMat.create(1000, 1000, CV_8U);
    int step = mat.step() / mat.elemSize();
    int channels = mat.nChannels();
    ByteBuffer buffer = mat.getByteBuffer();

    for (int i=0; i<1000; i++)
        for (int j=0; j<1000; j++) {
            byte val = buffer.get(step + j*channels);
            val = (byte) (2*val);
            buffer.put(step + j*channels, val);
        }
    System.out.println((System.nanoTime() - start) / 1000 + " us");
}

public static void testJavaArray() {
    long start = System.nanoTime();
    CvMat mat = CvMat.create(1000, 1000, CV_8U);
    int step = mat.step() / mat.elemSize();
    int channels = mat.nChannels();
    ByteBuffer buffer = mat.getByteBuffer();
    byte[] jbuffer = new byte[mat.size()];
    // I assume the buffer is continuous, as the mat.isContinuous() always 
    // returns false
    buffer.get(jbuffer);

    for (int i=0; i<1000; i++)
        for (int j=0; j<1000; j++) {
            byte val = jbuffer[step + j*channels];
            val = (byte) (2*val);
            jbuffer[step + j*channels] = val;
        }

    buffer.position(0);
    buffer.put(jbuffer);

    System.out.println((System.nanoTime() - start) / 1000 + " us");
}

Resulting times (after 10 iterations):

TEST    HOTSPOT VM 1.6, 3GHZ ATHLON X2       1GHZ 1CORE ANDROID 2.3.5 (no JIT)
testIJ                           280ms                  9800ms
testI                             65ms                  4500ms
testDirect                       3.5ms                  1450ms
testJavaArray                    5.1ms                    60ms

Surprising is the Android result, which I added for information. Probably due 
to the lack of JIT, even Buffer access is very slow. On desktop VM the 
JavaArray version is a bit slower than via the Buffer, due to unnecessary 
copying of the buffer data.

If you want to simplify pixel access for the users, you could add interface say 
"PixelAccessor", which will do the expensive operations of creating Buffer and 
calculating size upon initialization and users will be asked to reuse the 
instance. I attach BytePixelAccessor as an example, similar class should be 
created for other types. You can add createBytePixelAccessor() method to CvMat 
and IplImage (not getBytePixelAccessor(), so the name implies something is 
created upon every invocation).

Viliam

Original comment by viliam.d...@gmail.com on 3 Jun 2013 at 5:30

Attachments:

BytePixelAccessor.java

GoogleCodeExporter commented 9 years ago

They are not slow with floating-point matrices (native function call isn't slow 
BTW), but I agree they are not ideal, and shouldn't be used for integer 
matrices...

Anyway, thanks for all the recommendations! I'm not sure how this is going to 
turn out. The thing is OpenCV itself switched to "cv::Mat", and I'm currently 
working on something to help me wrap this kind of thing more easily (not just 
for OpenCV), without having to create all the .java files by hand... When and 
if I succeed, then I would start looking into better ways of accessing these 
structures in Java, i.e.: your accessor class.

But it might take much time before I come up with anything :(

What do you think about all this?

Original comment by samuel.a...@gmail.com on 8 Jun 2013 at 1:15

GoogleCodeExporter commented 9 years ago

I don't understand your statement that floating-point matrices aren't slow. I 
did a test with 32F matrices, and here are the results:

TEST    HOTSPOT VM 1.6, 3GHZ ATHLON X2       1GHZ 1CORE ANDROID 2.3.5 (no JIT)
testIJ_8u                        231ms                  9715ms
testI_8u                          59ms                  4467ms
testDirect_8u                    3.4ms                  1460ms
testJavaArray_8u                13.0ms                   120ms
testIJ_32f                       233ms                 10267ms
testI_32f                         65ms                  5050ms
testDirect_32f                   2.8ms                  2199ms
testJavaArray_32f               13.0ms                  1143ms

I did also test 64f matrices, and the times were roughly the same as float. The 
testJavaArray_32f is much slower than testJavaArray_8u on android, because 
probably some memory limit was approached (I saw grow heap operation after 
every test run during 32f, but not during 8u). Each test was run 10times, the 
result was taken by averaging the last five runs.

There *is* JNI call overhead. If you call, say, five operations on Mat, then 
the overhead is negligible, but if you call CvMat.get(i, j) 5-million times, 
which itself makes 5 JNI calls, then it is not. I first noticed this issue, 
when processing the HoughLines result. I iterated the returned collection of 
lines many times, and just by copying the lines to java array instead of 
calling CvPoint.x() and .y() it run noticeably faster.

I can contribute the PixelAccessor enhancement as a patch, if you wish.

Original comment by viliam.d...@gmail.com on 10 Jun 2013 at 5:03

GoogleCodeExporter commented 9 years ago

FYI, here is the benchmark: http://pastebin.com/DtGBkzyk

Original comment by viliam.d...@gmail.com on 10 Jun 2013 at 5:06

GoogleCodeExporter commented 9 years ago

Well, slow is relative, not as slow as creating a Buffer object :) But of 
course it's still not optimal. And yes Android/Dalvik does suck however one 
looks at it.

BTW, your "accessor" solution reminds me of this article:

Java Programming for High-Performance Numerical Computing
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.1554 

Actually, I feel this is something that should be included in JavaCPP. The way 
I see it we'd have a collection of classes, e.g.: ByteAccessor, ShortAccessor, 
etc. with constructors that accept a Buffer and the dimensions, and a 
collection of appropriate get() and put() methods for 2D and 3D access. 

Then, we could include factory methods in the corresponding Pointer classes, as 
well as others like CvMat, CvPoint, etc.

What do you think? Is this something you'd like to contribute? Sounds really 
useful in any case :)

And to cope with Android, instead of a Buffer, the accessors could also accept 
a Pointer, slurping and dumping native memory on user request only.

Original comment by samuel.a...@gmail.com on 13 Jun 2013 at 7:24

GoogleCodeExporter commented 9 years ago

I'd like to contribute. Lets summarize what has to be done:
- deprecate getXxxBuffer and fullSize methods on CvMat
- add XxxPixelAccessor classes
- add factory methods for PixelAccessors to CvMat and IplImage
- add CopyingPixelAccessor for Android (and others without JIT)

We should consider the PixelAccessor classes to be convenience classes for 
developer. The direct Buffer access is still fastest, even only slightly. This 
is about avoiding the complexity for the developer and avoid possible error 
when calculating the indexes. There is no point in adding it to the Pointer 
class, as it has no internal structure.

Regarding the CvPoint class, there are bulk put and get methods to and from 
java (get and put) which could be rewritten using buffers too:

    public final CvPoint put(int[] pts, int offset, int length) {
        asByteBuffer().asIntBuffer().put(javapoints, offset, length);
    }

    public CvPoint get(int[] pts, int offset, int length) {
        points.asByteBuffer().asIntBuffer().get(javapoints, offset, length);
    }

These are approximately 10 times faster than the current versions (tested on 
array of 10000 points on desktop java). Also, PointAccessor could be written, 
that could use Buffer access. It is about 30 times faster than using x() and 
y() methods. See the benchmark: http://pastebin.com/bUC6u5R7

Actually, such class could be written for any c++ class, and, as you see, it 
strongly depends on the internal structure of the class. These files should be 
generated...

I read a JNI performance article, and one of the recommendations was to keep 
the number of needed JNI calls low, because there is overhead. As from my 
experience, such algorithms should be written in C(++), and call them from JNI. 
Java should be used to call complex operations only (for example operations on 
Mats), and not to do pixel processing. Or for prototyping.

By the way, we are rewriting our entire algorithm to C++, as we use lot of 
pixel processing and point processing, and on android it will be a lot faster. 
Further, we plan iOS version too, so the code can be reused.

Original comment by viliam.d...@gmail.com on 14 Jun 2013 at 5:55

GoogleCodeExporter commented 9 years ago

I understand there are a lot of little details to take care of, but first 
things first: We need those "accessor" classes. Now, we could have a 
BytePixelAccessor, a BytePointerAccessor, etc. but I feel that having one 
ByteAccessor class usable on *any* structure would be more useful. Do you not 
feel this would be better?

Original comment by samuel.a...@gmail.com on 14 Jun 2013 at 1:22

GoogleCodeExporter commented 9 years ago

Let me try to explain this a bit better. Assume we have some `CvMat mat` and 
`CvPoint pts`, then I think it would be nice to be able to do something like 
this:

IntAccessor matAccessor = mat.getIntAccessor(); // or create ...
IntAccessor ptsAccessor = pts.getIntAccessor(); 
matAccessor.put(1, 1, ptsAccessor.get(1, 1));

Of course we'd have FloatAccessor and what not, and maybe the names could be 
different too, but what I want to emphasize is that the reuse of the common 
class, used by both CvMat and CvPoint in this case. Do you not feel this would 
be a good thing?

Original comment by samuel.a...@gmail.com on 16 Jun 2013 at 1:51

GoogleCodeExporter commented 9 years ago

The accessors are about structure. CvMat.data do have structure, CvPoint does 
too, but Pointer does not. They enable to write 

  matAccessor.get(i, j)

instead of 

  mat.getByteBuffer().get(i*mat.step() + j*mat.channels());

PixelAccessor saves the JNI calls and buffer creation (lets assume 
getByteBuffer does not return cached instance). PixelAccessor offers no more 
functionality than the current get() methods on CvMat, only it's lifetime is 
separate from that of CvMat. 

Try to write Accessor for Pointer. I think there would be no more functionality 
than in java.nio.XxxBuffer, so I think is needless.

Original comment by viliam.d...@gmail.com on 17 Jun 2013 at 4:03

GoogleCodeExporter commented 9 years ago

"Multidimensional arrays" in C/C++, the ones accessed like this `mat[i][j]`, 
are actually allocated in one contiguous area of memory: Check it up! Java's 
Buffer provides no way to access those. I think it would be very useful to 
provide an interface for native multidimensional arrays. Why do you feel 
differently?

Original comment by samuel.a...@gmail.com on 18 Jun 2013 at 12:30

GoogleCodeExporter commented 9 years ago

Mat is not multidimensional array, neither is Pointer or Mat.data. Actually, i 
have implemented the accessors already, but have not done any tests. I will 
send patch then, and then try to implement Accessor for Pointer yourself, 
you'll see there's no point in it.

Original comment by viliam.d...@gmail.com on 18 Jun 2013 at 1:06

GoogleCodeExporter commented 9 years ago

Sure, please explain how you would access the pixel at (100, 100) of a 
dc1394video_frame_t.image :
http://code.google.com/p/javacv/source/browse/src/main/java/com/googlecode/javac
v/cpp/dc1394.java#834

Original comment by samuel.a...@gmail.com on 19 Jun 2013 at 2:00

GoogleCodeExporter commented 9 years ago

I don't know, it don't know it's structure. If it's simple multidimensional 
array, then it's image.capacity(rows*cols).asByteBuffer().get(100*cols+100). I 
would cache the buffer for access for fast access to other pixels.

Original comment by viliam.d...@gmail.com on 19 Jun 2013 at 3:57

GoogleCodeExporter commented 9 years ago

Here is the patch. Have a look at it and let me know.

Original comment by viliam.d...@gmail.com on 19 Jun 2013 at 4:32

Attachments:

0001-pixel-accessor-API.patch

GoogleCodeExporter commented 9 years ago

I deprecated the getXxxBuffer() methods in CvMat, but found out that I still 
need them to create android bitmap:

  bitmap.copyPixelsFromBuffer(cvImage.getByteBuffer());

For backwards compatibility, I would still deprecate getXxxBuffer methods and 
create new createXxxBuffer() methods, that will always return new buffer. What 
do you think?

Original comment by viliam.d...@gmail.com on 19 Jun 2013 at 9:49

GoogleCodeExporter commented 9 years ago

Let's go back to comment #16 first. Wouldn't it be nice instead to be able to 
do something like this:
    ByteAccessor a = new ByteAccessor(image, rows, cols);
    byte b = a.get(100, 100);
?

Original comment by samuel.a...@gmail.com on 19 Jun 2013 at 12:11

GoogleCodeExporter commented 9 years ago

Sorry for not responding, I'm quite busy now... 

Now I got it: the CvMat's pixelaccessor could be generalized to 
multidimensional array accessor. But it not only has to know each dimension's 
length, but the step too.

Now, I would put these to JavaCPP project, not to JavaCV. In JavaCV, there 
could be factory methods in CvMat to simplify it's creation. What do you think?

Original comment by viliam.d...@gmail.com on 26 Jun 2013 at 8:02

GoogleCodeExporter commented 9 years ago

Yes!!

We don't need to name it "step", even though for images obviously that's what 
is going to be used for the "width". We don't need to know the actual width of 
an image to access one pixel in memory. And to generalize more easily, we could 
keep the width, height, step or whatever in one "int[] dimensions" array, and 
it would even possible to have get/put methods with varargs, but I'm not sure 
how that would play with specialized 2D and 3D methods... ?

Anyway, as names I think ByteAccessor, IntAccessor, etc. sound fine, but what 
do you think?

When you have something post it here:
http://code.google.com/p/javacpp/issues/
Thanks!

Original comment by samuel.a...@gmail.com on 27 Jun 2013 at 12:47

GoogleCodeExporter commented 9 years ago

Bump! It would be really nice to have those classes in JavaCPP. Please let me 
know if you are still interested in working on this, thanks!

Original comment by samuel.a...@gmail.com on 22 Sep 2013 at 12:46

GoogleCodeExporter commented 9 years ago

Yes, I definitely plan to do it, but we postponed our CV project due to other 
tasks, but if nothing changes, I will get back in two-three weeks.

Original comment by viliam.d...@gmail.com on 25 Sep 2013 at 11:08

GoogleCodeExporter commented 9 years ago

Hello,

I started to write the API. After writing it, I realized, that the code without 
this API is so simple that I hesitate whether new API is really needed. 

Example: Let's say we write a function to calculate sum of all pixels of 
1-channel image. Using the new pixel accessor API it would be:

public long imgSum(CvMat mat) {
  ByteArrayAccessor matacc = mat.createBytePixelAccessor(false);
  // cache to avoid repetitive JNI call
  int cols = mat.cols();
  int rows = mat.rows();
  long sum = 0;
  for (int i=0; i<rows; i++)
    for (int j=0; j<cols; j++)
      sum += matacc.get(i, j) & 0xff; // 0xff to convert to unsigned

  return sum;
}

By directly accessing the Buffer, the code would be:

public long imgSum(CvMat mat) {
  ByteBuffer buf = mat.createDataByteBuffer();
  int step = mat.step();
  int rows = mat.rows();
  int cols = mat.cols();
  long sum = 0;
  for (int i=0; i<rows; i++) 
    for (int j=0; j<cols; j++) 
      sum += buf.get(i*step + j) & 0xff;

  return sum;
}

The only saving is the formula i*step + j, that is hidden. Is this worth a new 
API?

Smarter programmer could even write it little faster like this:

public long imgSum(CvMat mat) {
  ByteBuffer buf = mat.createDataByteBuffer();
  int step = mat.step(), rows = mat.rows(), cols = mat.cols();
  long sum = 0;

  for (int i=0; i<rows; i++) {
    int pos = i*step;
    for (int j=0; j<cols; j++, pos++) 
      sum += buf.get(pos) & 0xff;
  }

  return sum;
}

This avoids i*step multiplication at every pixel, the trick I've learned by 
reading OpenCV's code... This trick is not possible with pixel accessor.

If I have time tomorrow, I'll benchmark all three versions.

Please have a look at the patches before I continue with other primitive types 
and let me know what do you think.

Viliam

Original comment by viliam.d...@gmail.com on 12 Nov 2013 at 10:15

Attachments:

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I just remembered, that step is always in bytes, it should be divided with 
elemSize(). It's long I did not work with OpenCV. I will rewrite it later.

BTW, our computer vision project is still postponed, maybe for very long. I'm 
sad of that, because it was very interesting for me and it would surely be very 
good selling mobile application. And finally, we plan to use neither javacpp 
nor javacv in the final product, due to performance. We do lot of pixel 
processing, and comparable function in OpenCV written in C runs 10 times faster 
than ours very similar in Java. That's on android, on desktop, the difference 
is not so big (probably due to JIT). But they were very good for us for initial 
prototypes, so I feel I owe to contribute...

Original comment by viliam.d...@gmail.com on 12 Nov 2013 at 10:28

GoogleCodeExporter commented 9 years ago

It's fine to use Buffer directly, but it does get complicated when accessing 
many elements of a matrix. It's a lot easier if we can specify the indices as 
we would in MATLAB for example. 

Your API looks fine, thanks! Still, I would shorten the names to ByteAccessor, 
etc instead of ByteArray..., BytePixel..., etc. because it's shorter and it 
works for direct NIO Buffer objects as well, not just arrays or pixels, so it 
removes potential confusion. And com.googlecode.javacpp.accessors (should this 
be singular or plural?) is good as package name I think. Putting it in a 
separate package also indicates that they may be used independently from the 
classes in the main package, so good idea, thanks!

BTW, we can still use JavaCPP to call custom C/C++ functions. It's much nicer 
than using raw JNI with the NDK, while imposing almost no additional overhead.

Original comment by samuel.a...@gmail.com on 17 Nov 2013 at 2:08

GoogleCodeExporter commented 9 years ago

Okay, so I'll finish the rest. I prefer "accessors" (plural). I also prefer 
ByteArrayAccessor, instead of ByteAccessor, because it improves readability. 
IDEs are to complete names. Anyway, I'll write "BAA" in eclipse, and I'll 
probably get better match than with "BA". I even thought of 
ByteNdArrayAccessor, but did not like this. But I'll do what you say, let me 
know.

I'd like to prepare an article, where I will summarize performance 
characteristics of each way, maybe on wiki. Then, I'd like to refer to that 
from the javadocs. There I could summarize individual ways of accessing pixels 
along with their benchmarks.

Original comment by viliam.d...@gmail.com on 18 Nov 2013 at 6:03

GoogleCodeExporter commented 9 years ago

Regarding the usage of JavaCpp in our project, we only have two JNI methods. I 
never used JNI before, so it was a way to learn it and to be able to better 
understand javacpp, because, due to the lack of documentation and no experience 
with JNI it seemed complicated to me...

Original comment by viliam.d...@gmail.com on 18 Nov 2013 at 6:06

GoogleCodeExporter commented 9 years ago

I'm not sure it's a good idea to name "ByteArrayAccessor" something that 
doesn't access a byte array (it's technically accessing a buffer in the 
non-copied case at least)... What do you think?

And something I've just noticed from your code, but `ByteBuffer.put(byte[])` is 
very very slow on Android, that's why I added `BytePointer.put(byte[])`:
    https://code.google.com/p/javacpp/issues/detail?id=11
So, let's think a bit more about that, and your `copyBack()` method, before 
going further.

I've added you as contributor, so please feel free to add a wiki page! Maybe 
you'd prefer to add one to the JavaCPP site though?

If you have only two JNI methods, maybe JavaCPP isn't so useful yes.

Original comment by samuel.a...@gmail.com on 24 Nov 2013 at 2:01

GoogleCodeExporter commented 9 years ago

Regarding the issue 11, it seems that it is some improper buffer implementation 
in android. That's why I even created the otherwise odd Copying version. Maybe 
if I use pointer.get(int) method, that would be faster and make Copying version 
useless. Have to do benchmark.

Regarding the name, what better name you have? We started with 
BytePixelAccessor for CvMat, but then generalized it to all arrays, so 
ByteArrayAccessors. Alternatives that come to my mind are:
  ByteNdArrayAccesor
  ByteNdElementAccessor
  ByteNdArrayElementAccessor
  ByteAccessor
  ByteUtils
  BytePointerUtil
  BytePointerAccessor
  ByteMdArrayAccessor
  ByteMultiDimArrayAccessor
  ByteElementAccessor
  ElementAccessorByte
  ArrayElemAccessorByte
  ByteArrayElemAccessor

I think, the only self-describing is the longest. Any other would be explained 
in javadoc.

Original comment by viliam.d...@gmail.com on 25 Nov 2013 at 3:17

GoogleCodeExporter commented 9 years ago

JNI calls are not as efficient as accessing array elements, so the copying 
accessor makes sense. But we still need JNI to copy the whole array, something 
like BytePointer.get(byte[])/put(byte[]), and we might as well use that.

So, the names could be ByteBufferAccessor and BytePointerAccessor. I think 
that's pretty self-describing? Maybe some other name than "Accessor" could 
clarify the meaning. Although I wouldn't want to lengthen the name further, so 
I'm not sure what...

Original comment by samuel.a...@gmail.com on 26 Nov 2013 at 6:25

GoogleCodeExporter commented 9 years ago

Maybe we could name them "Indexer" since it's a concept already used in C# for 
multidimensional arrays, but not in Java?
http://www.javacamp.org/javavscsharp/indexer.html
http://msdn.microsoft.com/en-US/library/2yd9wwz4%28v=vs.80%29.aspx

That single word carries both the meaning of Array and Accessor, so we could 
shorten the names, i.e.: ByteIndexer...

Original comment by samuel.a...@gmail.com on 30 Nov 2013 at 4:57

GoogleCodeExporter commented 9 years ago

Okay, so here are complete patches with Indexer API. Apply both patches to 
javacpp (the first adds "accessors", the second deletes it and adds "indexers", 
I was lazy to do a clean clone...)

I liked the "Indexer" name most, so I completed it using this. I also did 
JUnit-like test, but it is not included in patches, as you don't have JUnit set 
up in your project.

I also did performance test, with interesting results: 

Test configurations
1. jdk 1.6.0_41-64bit, athlon x2 64, 2-core 3GHz, run with -server switch
2. jdk 1.7.0_17-64bit, athlon x2 64, 2-core 3GHz, run with -server switch
3. Android 2.3.5, Qualcomm MSM8255 1-core 1GHz

                   (1)       (2)        (3)
sumBuffer         3507      3432    2 835 870 
sumPointer       93962     93992    1 683 551 
sumBufferCopy     5176      5275      132 824 
sumPointerCopy    3824      4112      125 451 
sumIndexer        3342     17884    3 329 522 
sumIndexerCopy    4234     20491      742 822 
cvSum             1885      1866       16 809 
addBuffer         7688      7375    6 068 920 
addPointer      207170    238679    3 636 462 
addBufferCopy     9399      9423      137 023 
addPointerCopy    6683      6488      135 833 
addIndexer        7277     20948    7 160 247 
addIndexerCopy    7628      8221    1 317 675 
cvAddS            1986      1974       33 465 

Test sources are in attached Test.java. It tests two operations (intensive to 
pixel access). "Sum" adds all pixel values (only does reading), "add" modifies 
pixels by incrementing by one. As you see, there are huge differences. 

The "never do" version is accessing value through pointer element by element 
(sumPointer and addPointer). Android's "never do" is to directly access native 
pixels, but rather copy them. I did not reproduce the android slowness with 
ByteBuffer.get(int[]) (compare sumBufferCopy and sumPointerCopy)

Indexer access is a bit slower than direct work with buffer or with copied 
array on android, but it's code is the simplest. This might be due to 
unnecessary method calls, and due to I cannot use premultiplied value the way I 
do it when accessing buffer. Strange is, that it is lot slower on JDK 1.7...

Not surprisingly, C-version is always fastest (1.7-3.5 times on HotSpot and 4-7 
times on Android. And it is not dependent on JIT/GC mood :)

I have not a device with newer android handy, can you do a test with that? I 
expect, this problem will disappear (or maybe disappeared) one day.

My final recommendations is to always prefer C. Use JavaCV only if:
- you don't need to access pixels, if you only call OpenCV's bulk operations 
(such as a sequence of various image transformations)
- you need portability (but you will need native libraries anyway)
- you are prototyping (i.e. less segfaults in java)
- your algorithm is fast enough

If you decide to use pixel access, use these method:
Android:
1. access copied data manually
2. use copying indexer for less code and slower speed (6-10 times)
On JDK
1. access buffer manually
2. use direct indexer for less code and slower speed (may not be slower, but 
may be up to 5 times slower). Run on a good JRE :).
3. the more times you access the same pixel, the more prefer copying the array
General:
Always create a copy using Pointer.get(), not Buffer.get().

---
The above could be a basis for a wiki page I recommend to write. We could link 
to it in the sources.

Please review the patches, and, if you can, try to review the benchmark too, if 
I have not done something wrong.

Viliam

Original comment by viliam.d...@gmail.com on 11 Dec 2013 at 9:51

Attachments:

GoogleCodeExporter commented 9 years ago

I accidentaly got an Nexus 10 with Android 4.4 handy with ARM Cortex A15 2-core 
1.7 GHz, here are the results:

sumBuffer        965 584
sumPointer     5 105 642
sumBufferCopy     34 331
sumPointerCopy    34 406
sumIndexer     1 151 433
sumIndexerCopy   165 345
cvSum              3 335
addBuffer      2 128 167
addPointer    11 186 691
addBufferCopy     35 940
addPointerCopy    35 791
addIndexer     2 574 975
addIndexerCopy   410 906
cvAddS             8 068

Still no change, buffer element access is slow, buffer bulk access is as fast 
as pointer's, and C is 10 times faster (in this case opencv cvSum or cvAddS 
might even used multiple cores, did not check the code). Still, copying is lot 
faster.

Original comment by viliam.d...@gmail.com on 11 Dec 2013 at 10:47

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Great, thanks! It's going to take me a while to review that. It's actually 
important functionality I think, so I want to make sure we get the API right. I 
have a couple of comments for now though:
1. Please do provide patches that apply cleanly to the HEAD of master branches 
(no need right now though)
2. Are you OK with licensing that under the GPLv2 with Classpath exception? Or 
would you prefer something else?
3. Maven supports unit testing if we place files in the right directories, so 
feel free to put stuff there
4. You seem to provide a couple of unrelated fixes for opencv_core.java. Are 
all the changes that are in there indeed fixes?
5. Have you tried to write data back to native memory? Direct NIO buffers on 
Android should be pretty slow in that case.

Thanks for everything!

Original comment by samuel.a...@gmail.com on 15 Dec 2013 at 7:49

GoogleCodeExporter commented 9 years ago

1. Patches should apply to the HEAD. There are just two patches, because I did 
it in two commits
2. Yes
3. I'll try, hopefully next week.
4. There is only one unrelated fix for CvMat.isContinuous(). It always returned 
false, because it used type() instead of raw_type(), and type() has some bits 
cleared, including the continuous bit.
5. The "sum" benchmark does not write back, it only reads pixels. The "add" 
benchmark does (it modifies pixels).

Original comment by viliam.d...@gmail.com on 15 Dec 2013 at 11:16

GoogleCodeExporter commented 9 years ago

5. Does this mean both Buffer.put(array)/get(array) work efficiently on 
Android? I wonder why Adam found different: 
https://code.google.com/p/javacpp/issues/detail?id=11

Original comment by samuel.a...@gmail.com on 15 Dec 2013 at 1:04

GoogleCodeExporter commented 9 years ago

I reproduced issue 11. I actually wonder, why it does not reproduce in my test 
case. Maybe because I've tested only ByteBuffer, but in issue 11 they have 
problem with floats or ints, I will have to give it a try... 

Anyway, I use Pointer's bulk methods for copying in the Copying indexers.

Original comment by viliam.d...@gmail.com on 15 Dec 2013 at 1:26

GoogleCodeExporter commented 9 years ago

Ok, so let's take a step back. "NIO" stands for "New I/O", and the intention 
was that we should be able to do everything and do it efficiently using only 
*Buffer objects, and nothing else, but on Android this isn't the case, yet. So 
we're left to hack things up, for now. And your original proposal was optimal 
in that sense.

What would you think of the following then. We have only Indexer per type 
(ByteIndexer, ShortIndexer, etc) and they all accept a Buffer object of the 
corresponding type. But we also provide a flag to indicate whether we want to 
use it in "copying" mode. (BTW, it's possible to create a Pointer from a 
Buffer. That's what I'm doing in FFmpegFrameRecorder.java for example.) And 
eventually when Android gets better, this flag should become obsolete, but at 
least we won't be stuck with a hackish API.

How does that sound?

Original comment by samuel.a...@gmail.com on 16 Dec 2013 at 3:56

Changed state: Started

GoogleCodeExporter commented 9 years ago

Sounds good. But the normal use case for copying version would be the following:

  CopyingByteIndexer ix = (CopyingByteIndexer) ByteIndexer.createXxx(..., true);
  // doing work...
  // ix.copyBack();

If we would want to silently ignore the copying flag, we would not return 
CopyingByteIndexer instance anymore. So I recommend to make Copying- and 
Direct- subclasses package-visible (i.e. default visibility) and put the 
copyBack() to the abstract method, which would be no-op for Direct- version. It 
will be cleanly stated in the documentation, when we need to call copyBack().

I did tests with IntPointer/IntBuffers, here are the results:
                   (1)     (2)      (3)    
sumBuffer         1673    1781  1055676    
sumPointer       23049   22984   452716    
sumBufferCopy     4172    4168  1081054    
sumPointerCopy    3026    3026    40960    
sumIndexer        1612    5785  1180523    
sumIndexerCopy    4137    6935   302331    
cvSum             1399    1364     9222    
addBuffer         2325    2289  2119384    
addPointer       55452   48051   865838    
addBufferCopy     7648    7529  1084564    
addPointerCopy    4990    4744    45794    
addIndexer        2230    6889  2490893    
addIndexerCopy    7215    7943   448577    
cvAddS            1778    1870    11267    

Now the slowness of bulk buffer get is reproduced (the issue 11 problem). And 
pointer copy takes 2/3 of the time of buffer copy even on JDK, so it's always 
faster.

What's interesting, with bytes, the addPointerCopy was faster than addBuffer. 
With ints, it's the opposite. Now it's very tough to give a general rule of 
what's faster. I can think of these:
- on Android, always work on a copy. On JDK, do your own test
- work manually with a buffer or a copy for better speed
- use indexer for shorter but slower code

Tomorrow, I'll rewrite the code to always use buffer parameter. And what do you 
think about hiding the copying and direct subclasses? Anyway, I don't think we 
will be able to remove copying version anytime sooner than in 15-20 years :)

Viliam

Original comment by viliam.d...@gmail.com on 16 Dec 2013 at 5:54

GoogleCodeExporter commented 9 years ago

Well, I don't know about the subclasses either. In the case of a direct NIO 
buffer, we can copy its content to a non-direct NIO buffers. We would basically 
use the two-class pattern at the Buffer level. No need to duplicate it a level 
higher, if it's as efficient that way. I'm pretty sure that would be as 
efficient as an array, even on Android, but we should test this out.

BTW, when comparing with optimized functions like cvSum(), be sure to optimize 
the code in Java as well, by unrolling loops, etc:
http://stackoverflow.com/questions/10784951/do-any-jvms-jit-compilers-generate-c
ode-that-uses-vectorized-floating-point-ins/10809123#10809123

Original comment by samuel.a...@gmail.com on 16 Dec 2013 at 6:12

GoogleCodeExporter commented 9 years ago

Not sure about your 15~20 years estimate either 
http://www.extremetech.com/computing/170677-android-art-google-finally-moves-to-
replace-dalvik-to-boost-performance-and-battery-life
:)

Original comment by samuel.a...@gmail.com on 16 Dec 2013 at 6:19

GoogleCodeExporter commented 9 years ago

Here are results for Android with unrolled loop, compared to C

(1) android 2.2 1-core 1GHz with unrolled loops
(2) same as above, without unrolling

                   (1)      (2)
sumBuffer       981770  1055676    
sumPointerCopy   31024    40960    
cvSum             9002        -

As you see, a bit better, but not a lot. Maybe the Android's JIT does automatic 
unrolling, but still not comparable to C. So still no reason to use java on 
android if you need performance. I checked the source of cvSum, they explicitly 
unroll 4 times in case of 1-channel images. In others, they just do simple loop 
(and the compiler might do automatic unroll).

I also tried the the unrolled version on HotSpot, the difference is negligible:

(1) with unrolled loops, jdk 1.6.0_41-64bit, athlon x2 64, 2-core 3GHz, run 
with -server switch
(2) same as above, without unrolling

                   (1)      (2)
sumBuffer         1696     1673
sumPointerCopy    2950     3026    
cvSum             1453        -

Note, that I measure the the times as an average of 5 runs, after 15 "heat up" 
runs, with the -server JVM switch (-server causes lower threshold for the JIT 
to take place). Try to do this with your example at 
http://stackoverflow.com/a/10809123/952135, probably openjdk's JIT does not 
unroll automatically.

Copying still helps a lot, even on android 4.4 (see comment 35). I will test 
the ART and will let you know, maybe then we can disable copying version in 10 
years, when non-art is obsolete...

Tomorrow, I'll rewrite indexers to your previous notes.

Original comment by viliam.d...@gmail.com on 18 Dec 2013 at 7:44

GoogleCodeExporter commented 9 years ago

Well in any case, I think we should compare with the same loop in C++ instead 
of calling cvSum(), just to be fair.

To be clear, I see the constructor of ByteIndexer doing something like this:
ByteIndexer(Buffer buffer, boolean cached) {
    if (cached) {
        this.buffer = this.cache = ByteBuffer.allocate(buffer.size());
        this.pointer = new Pointer(buffer);
    } else {
        this.buffer = buffer;
    }
}

With a couple of helper methods:
void fillCache() {
    pointer.get(cache.array());
}
void flushCache() {
    pointer.put(cache.array());
}

Seems like that would be the best things to do...

Original comment by samuel.a...@gmail.com on 18 Dec 2013 at 11:52

GoogleCodeExporter commented 9 years ago

Tried to implement suggestions from comment 46, but failed due to slower 
performance:

1. Using heap buffer instead of array is slower:

(1) Copying version with buffer
(2) Copying version with array

                   (1)     (2)
sumIndexerCopy    6348    4137    
addIndexerCopy   10934    7215    

Tested only on HotSpot 1.6, but I have not doubt it will be slower on Android

2. Using buffer to initialize indexers is slower in case of many small 
matrices. It does unnecessary conversion from pointer to buffer and back to 
pointer. We can do about 1000 such conversions per ms on my 3GHz Athlon x2. On 
android, it will be definitely longer.

CvMat (and probably all JavaCPP-mapped classes) return their arrays as 
Pointers. I don't think the API from latest patch is "hackish". It is part of 
JavaCPP and supposed to simplify iterating multidimensional arrays, and it is 
created from pointer, that is commonly used by JavaCPP.

3. I renamed copyBack() to flushCache(), and added fillCache() method, moved 
that to the base class and renamed the "copying" flag to "cached". Made 
subclasses package-visible.

4. I added JUnit test.

Original comment by viliam.d...@gmail.com on 19 Dec 2013 at 8:17

Attachments:

patches.zip

GoogleCodeExporter commented 9 years ago

Could you provide the code that gave you these results? (BTW, no need to update 
all your classes until we decide on the final API...)

Running the attached TestIndexer on my gear here (java version "1.7.0_45" 
OpenJDK fedora-2.4.3.0.fc19-x86_64 u45-b15) outputs this:

-249231677 171
-249231677 209
-249231677 219
-249231677 223
-249231677 233
-249231677 223
-249231677 167
-249231677 167

Doesn't look like there's a clear winner: Both are fast. What do you get?

Original comment by samuel.a...@gmail.com on 20 Dec 2013 at 2:38

Attachments:

TestIndexer.java

GoogleCodeExporter commented 9 years ago

Try to run more iterations, and add -server switch to the jvm (don't know if 
openjdk supports it). The aim is to JIT to take place. I dont have my code, I 
rewrote the indexer and ran test from comment 34. After seeing that it is 
slower, I reverted it.

Original comment by viliam.d...@gmail.com on 20 Dec 2013 at 5:48

GoogleCodeExporter commented 9 years ago

[deleted comment]

JenyaKirmiza / javacv

Cached value of CvMat.fullSize and *Buffer fields might get incorrect #317