caprica / picam

Unofficial Java API library for the Raspberry Pi camera.
GNU General Public License v3.0
49 stars 11 forks source link

Library occasionally hangs on `takePicture` #9

Closed n3o59hf closed 5 years ago

n3o59hf commented 5 years ago

Could be related to #2.

Camera configuration:

CameraConfiguration.cameraConfiguration()
                .width(1920)
                .height(1080)
                .automaticWhiteBalance(AutomaticWhiteBalanceMode.AUTO)
                .exposureMode(ExposureMode.AUTO)
                .exposureMeteringMode(ExposureMeteringMode.MATRIX)
                .encoding(Encoding.JPEG)
                .quality(85)

Happens rarely but more than once per day if taking photos constantly. Hardware: Raspberry Pi 2B, revision a01041, Pi Camera rev v1.3. Software: Raspbian 9.6, fresh install with all updates.

Restarting application helps. Waiting does not help (tested > 1h). Calling close does not help. Requesting one more image on different thread does not help.

Trace logs attached, around timestamp 741834 library did not return image. timelapse_logs.txt

caprica commented 5 years ago

I am not sure it's related to #2 because in your case you don't get even the first callback. During my investigations into #2 whilst it was possible to hang before the first callback, I never saw that occur once during thousands of tests. The failure was always in a later callback.

Anyway, it is not clear to me how your application is structured because you mention threads, you don't have potentially multiple threads using the camera at the same time do you?

There's no other app or process using the camera either?

A JVM memory dump might be useful here, but I doubt it.

A JVM native thread dump would probably be quite useful.

n3o59hf commented 5 years ago

Camera access is behind mutex so there shouldn't be multiple threads accessing camera at the same time. Thread that calls takePicture can be different each time. No other app or process use camera, fresh Raspbian installation with only one java app and ssh daemon running.

Will see if I can make isolated case that causes the same issue and take memory and thread dump (will take some time).

caprica commented 5 years ago

Are you absolutely sure your own threading and synchronisation is not introducing races or deadlocks?

The harsh truth is I don't encounter the problem, so I'm not going to take the default position as the problem is in the library unless I see some clear evidence to the contrary. And even then, I dunno, I spent a LOT of time on #2 to get to a solution and I can't imagine where else I could go.

n3o59hf commented 5 years ago

With simplified application:

class Timelapse { // Line 11
    companion object {
        @JvmStatic
        fun main(args: Array<String>) {
            val camera = Camera(
                CameraConfiguration.cameraConfiguration()
                    .width(1920)
                    .height(1080)
                    .automaticWhiteBalance(AutomaticWhiteBalanceMode.AUTO)
                    .exposureMode(ExposureMode.AUTO)
                    .exposureMeteringMode(ExposureMeteringMode.MATRIX)
                    .encoding(Encoding.JPEG)
                    .quality(85)
            )

            while (true) {
                val captureHandler = ByteArrayPictureCaptureHandler()
                camera.takePicture(captureHandler)
            }
        }
    }
}

hang after ~ 14 hours of constant running.

Log appears to be the same:

51823317 [MMALCallback] [TRACE] uk.co.caprica.picam.EncoderBufferCallback - nextBuffer=MMAL_BUFFER_HEADER_T(native@0x65068ea8) (56 bytes) {
  MMAL_BUFFER_HEADER_T$ByReference next@0=null
  Pointer priv@4=native@0x65068f08
  int cmd@8=0
  Pointer data@c=native@0x6506a0e0
  int alloc_size@10=14000
  int length@14=0
  int offset@18=0
  int flags@1c=0
  long pts@20=8000000000000000
  long dts@28=8000000000000000
  MMAL_BUFFER_HEADER_TYPE_SPECIFIC_T$ByReference type@30=MMAL_BUFFER_HEADER_TYPE_SPECIFIC_T$ByReference(native@0x65068ee0) (40 bytes) {...}
  Pointer user_data@34=null
}
51823317 [MMALCallback] [DEBUG] uk.co.caprica.picam.EncoderBufferCallback - result=0
51823317 [MMALCallback] [DEBUG] uk.co.caprica.picam.EncoderBufferCallback - finished=true
51823317 [MMALCallback] [DEBUG] uk.co.caprica.picam.EncoderBufferCallback - signal capture complete
51823318 [main] [INFO] uk.co.caprica.picam.Camera - Capture completed
51823318 [main] [INFO] uk.co.caprica.picam.Camera - <<< End Take Picture <<<
51823318 [main] [INFO] uk.co.caprica.picam.Camera - >>> Begin Take Picture >>>
51823318 [main] [DEBUG] uk.co.caprica.picam.Camera - takePicture()
51823318 [main] [INFO] uk.co.caprica.picam.Camera - Preparing to capture...
51823318 [main] [DEBUG] uk.co.caprica.picam.Camera - delay=5
51823323 [main] [DEBUG] uk.co.caprica.picam.Camera - startCapture()
51823323 [main] [DEBUG] uk.co.caprica.picam.MmalParameterUtils - mmal_port_parameter_set_boolean(id=65553,enable=1)
51823324 [main] [TRACE] uk.co.caprica.picam.MmalParameterUtils - param=MMAL_PARAMETER_BOOLEAN_T(auto-allocated@0x65086f88 (12 bytes)) {
  MMAL_PARAMETER_HEADER_T hdr@0=MMAL_PARAMETER_HEADER_T(allocated@0x65086f88 (8 bytes) (shared from auto-allocated@0x65086f88 (12 bytes))) {
    int id@0=10011
    int size@4=c
  }
  int enable@8=1
}
51823326 [main] [DEBUG] uk.co.caprica.picam.Camera - result=0
51823326 [main] [INFO] uk.co.caprica.picam.Camera - Capture started
51823326 [main] [DEBUG] uk.co.caprica.picam.Camera - wait for capture to complete

Here's thread dump:

2019-01-14 17:44:58
Full thread dump Java HotSpot(TM) Client VM (25.65-b01 mixed mode):

"Attach Listener" #10 daemon prio=9 os_prio=0 tid=0x65517800 nid=0x7b1f runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"MMALCallback" #9 daemon prio=5 os_prio=0 tid=0x65500800 nid=0x753a runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"Service Thread" #6 daemon prio=9 os_prio=0 tid=0x76183000 nid=0x7535 runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"C1 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x7617fc00 nid=0x7534 waiting on condition [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x7617e400 nid=0x7533 runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x76162000 nid=0x7532 in Object.wait() [0x65ba7000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
    - locked <0x6ada0aa0> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

   Locked ownable synchronizers:
    - None

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x76160800 nid=0x7531 in Object.wait() [0x65bf7000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:157)
    - locked <0x6ada0c40> (a java.lang.ref.Reference$Lock)

   Locked ownable synchronizers:
    - None

"main" #1 prio=5 os_prio=0 tid=0x76107400 nid=0x752f waiting on condition [0x762e0000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x6add6c50> (a java.util.concurrent.Semaphore$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
    at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
    at uk.co.caprica.picam.EncoderBufferCallback.waitForCaptureToFinish(EncoderBufferCallback.java:56)
    at uk.co.caprica.picam.Camera.takePicture(Camera.java:158)
    at lv.n3o.rpicamera.Timelapse$Companion.main(Timelapse.kt:28)
    at lv.n3o.rpicamera.Timelapse.main(Timelapse.kt)

   Locked ownable synchronizers:
    - None

"VM Thread" os_prio=0 tid=0x7615d400 nid=0x7530 runnable 

"VM Periodic Task Thread" os_prio=0 tid=0x76185000 nid=0x7536 waiting on condition 

JNI global references: 273

heap_dump.zip

caprica commented 5 years ago

Presumably this is still a problem?

caprica commented 5 years ago

This call is what is supposed to trigger the capture:

private void startCapture() {
    logger.debug("startCapture()");

    int result = mmal_port_parameter_set_boolean(cameraCapturePort, MMAL_PARAMETER_CAPTURE, 1);
     logger.debug("result={}", result);

    if (result != MMAL_SUCCESS) {
        throw new RuntimeException("Failed to start capture");
    }

    logger.info("Capture started");
}

From the posted log, that startCapture() call succeeds:

51823323 [main] [DEBUG] uk.co.caprica.picam.Camera - startCapture()
51823323 [main] [DEBUG] uk.co.caprica.picam.MmalParameterUtils - mmal_port_parameter_set_boolean(id=65553,enable=1)
51823326 [main] [DEBUG] uk.co.caprica.picam.Camera - result=0
51823326 [main] [INFO] uk.co.caprica.picam.Camera - Capture started

The next thing that happens, in the same thread, is this:

try {
    logger.debug("wait for capture to complete");
    encoderBufferCallback.waitForCaptureToFinish();
    logger.info("Capture completed");
}

Again, from the posted log, we see...

51823326 [main] [DEBUG] uk.co.caprica.picam.Camera - wait for capture to complete

The method call waitForCaptureToFInish acquires a semaphore used to synchronise completion of the capture, this is shown in the thread dump:

"main" #1 prio=5 os_prio=0 tid=0x76107400 nid=0x752f waiting on condition [0x762e0000]
   java.lang.Thread.State: WAITING (parking)
    at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
    at uk.co.caprica.picam.EncoderBufferCallback.waitForCaptureToFinish(EncoderBufferCallback.java:56)
    at uk.co.caprica.picam.Camera.takePicture(Camera.java:158)

The problem here is that the initial native call to trigger the capture, which returns a success indicator, does not appear to invoke the capture callback at all, not even once.

This is the thread that handles the native callback, it is RUNNABLE - meaning it is not blocked, it is simply not being triggered by native code after out request.

"MMALCallback" #9 daemon prio=5 os_prio=0 tid=0x65500800 nid=0x753a runnable [0x00000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

Why might this happen?

  1. I suppose it's possible that the Java callback object somehow went out of scope and got garbage collected, but this would ordinarily result in a fatal JVM crash that would kill the entire application, this is not what is happening
  2. something is going wrong in the MMAL layer - the native trigger to start the capture returns success, so something has gone wrong after that initial trigger (outside of the Java code, we're in the native MMAL code now)
  3. something has gone wrong with the camera/driver

For points 2 and 3, it might be possible to get some clues from the native MMAL logging.

As far as this Java library is concerned, rather than having the semaphore acquire wait indefinitely, it could be implemented with a timeout, and then throw a failure, hoping that the next capture call will work.

This might be worth doing, but I really don't like adding defensive code like this - defensive code often hides bugs that should otherwise be fixed.

Now, even if I add that timeout on the semaphore acquire (which I guess I would make configurable), this would stop the application hanging, but wouldn't solve the underlying issue.

But I really don't see where else I can go in the Java code - surely this problem is somewhere else.

caprica commented 5 years ago

Closing pending any follow-up, I really don't know what more I can do from the Java side.

If more information comes to light, this can be reopened.

jcondon1 commented 5 years ago

I'm also seeing a similar problem. I have a Java program running on Pi 3 B+ rel. 2018-11-13, that constantly takes pictures via call to camera.takePicture() using the ByteArrayPictureCaptureHandler. The program takes about 3 pictures per second and runs 7 hours a day. I close and re-open the camera object every 10 minutes and take a picture with 2 second delay to encourage exposure calibration. The application runs fine some days but occasionally hangs in the call to takePicture(). Sorry I don't have more information for you but I am seeing a similar problem as n3o59hf. Note: prior to using your camera library I used the openCV camera routine for a couple years which never hung but didn't take very good quality pictures. I like the picture quality I get from your library and the raspistill program but the raspistill program is way to slow for my use.

caprica commented 5 years ago

Need native MMAL logs, frankly.

jcondon1 commented 5 years ago

I'm not a very sophisticated user so not sure how to get what you asked for but after killing the application and then doing "sudo vcdbg log msg", the tail of the output shows the following, not sure if this is of any help:

2579380.937: mmal: mmal_port_disable_internal: ril.camera:out:2(OPQV) waiting for 0 buffers left in transit 2579380.991: mmal: mmal_port_disable_internal: ril.camera:out:2(OPQV) has no buffers left in transit 2579383.057: mmal: mmal_port_disable_internal: ril.image_encode:in:0(OPQV) waiting for 0 buffers left in transit 2579383.100: mmal: mmal_port_disable_internal: ril.image_encode:in:0(OPQV) has no buffers left in transit 2579389.080: mmalsrv: send_buffer_to_host: tx failed:size 292 st -1 2579389.134: mmal: mmal_port_disable_internal: ril.image_encode:out:0(JPEG) waiting for 0 buffers left in transit 2579389.185: mmal: mmal_port_disable_internal: ril.image_encode:out:0(JPEG) has no buffers left in transit 2579389.239: mmal: mmal_port_disconnect: ril.image_encode:out:0(JPEG)(3fbaec40) is not connected 2579389.471: mmal: mmal_port_disable_internal: ril.camera:ctr:0 waiting for 0 buffers left in transit 2579389.522: mmal: mmal_port_disable_internal: ril.camera:ctr:0 has no buffers left in transit 2579389.573: mmal: mmal_port_disconnect: ril.camera:out:0(3f7349a0) is not connected 2579389.617: mmal: mmal_port_disconnect: ril.camera:out:1(3f735900) is not connected 2579389.665: mmal: mmal_port_disconnect: ril.camera:out:2(OPQV)(3f736860) is not connected 2579389.707: mmal: mmal_port_disconnect: ril.camera:clk:0(3fbb1860) is not connected 2579440.772: gpioman: gpioman_get_pin_num: pin CAMERA_LED not defined

caprica commented 5 years ago

I appreciate the trouble you went to to get that log.

However, the essential problem is to find out what happens here:

int result = mmal_port_parameter_set_boolean(cameraCapturePort, MMAL_PARAMETER_CAPTURE, 1);

This returns a success indicator, but the native callback that provides the picture data never triggers in the failing scenario.

So the useful part of the MMAL log is to see what happens around that particular call.

The part of the log posted looks like cleanup (mmal_port_disable), which is probably too late, although 2579389.080: mmalsrv: send_buffer_to_host: tx failed:size 292 st -1 looks suspicious.

caprica commented 5 years ago

I still don't know what I can do about the underlying problem, but I went ahead and added some defensive code at b6f7426eeb86acb9b164d2fbce9afb0f40f48ce2.

The takePicture method will now throw a CaptureFailedException (with a root-cause exception if there is one) if you set a capture timeout on the camera configuration and that timeout expires while waiting for the capture to complete.

The default timeout value is -1, which means just wait forever (which was the previous behaviour).

If the timeout value is > -1, the thread will wait for the capture to complete or for the timeout to expire.

The timeout is a number of milliseconds.

jcondon1 commented 5 years ago

Thanks for all the effort, really appreciate it. I'm currently using picam-0.1.1.jar I downloaded from https://jar-download.com/artifacts/uk.co.caprica/picam . Would it be possible to provide a JAR file with the added defensive code you mentioned? If so I will install it and test. Sorry if I'm being a pain.

caprica commented 5 years ago

Well... git clone and mvn install will get you your jar.

But here's a snapshot.

I haven't tested this in anger yet.

picam-0.1.4-SNAPSHOT.zip

Rename the file to .jar rather than .zip when you download it (Github doesn't allow .jar files to be attached to posts)

caprica commented 5 years ago

Remember this doesn't really fix anything, it just allows you to hopefully detect the failure case and throw the camera instance away.

jcondon1 commented 5 years ago

Understand, the failure happens so rarely, losing one picture will not be a problem, Thanks!

jcondon1 commented 5 years ago

Have installed the new JAR and it appears to be working OK. For some reason I now HAVE to specify a stereoscopicMode and shutterSpeed in the camera config statement where before I was allowing those options to default but not a problem once I figured out the issues. I also verified I get an ABException e = java.lang.NullPointerException when I specify a real short timeout. May take a day or two before I get a real hang. Thanks again.

caprica commented 5 years ago

I'll make stereoscopic mode and shutterspeed optional in the next version.

The thing about the exception you mention, a short timeout obviously means the timeout is more likely to occur. And if the timeout occurs but the camera has not hung, then other code in other threads may still be executing. So I'm not so troubled by an exception in such cases.

But if you could post the exception stack trace we can make sure it's not some other problem.

jcondon1 commented 5 years ago

Will do.

jcondon1 commented 5 years ago

Have been running my app. for 3 days, 7 hours per day, approx. 75k pictures per day. I am taking pictures at resolution 1280 by 720 with a 3 second timeout. After 3 1/2 days, the timeout finally occurred, see below stack trace.

uk.co.caprica.picam.CaptureFailedException: uk.co.caprica.picam.CaptureTimeoutException at uk.co.caprica.picam.Camera.takePicture(Camera.java:182) at picamera.PiCamera.takePicture(PiCamera.java:829) at picamera.PiCamera.(PiCamera.java:278) at picamera.PiCamera.main(PiCamera.java:227) Caused by: uk.co.caprica.picam.CaptureTimeoutException at uk.co.caprica.picam.EncoderBufferCallback.waitForCaptureToFinish(EncoderBufferCallback.java:60) at uk.co.caprica.picam.Camera.takePicture(Camera.java:178) ... 3 more

My app. received the exception, closed the Camera object and re-opened it without any problems. Don't know if the stack trace is of any value to you but I'm pretty happy that my app. no longer hangs when this error occurs and I can easily recover. Let me know if there is anything else I can do. Thanks for providing the exception code.

caprica commented 5 years ago

I was actually more interested in the exceptions you mentioned when you have a short timeout.

However, congrats on your 3.5 day test run!

If you can detect this condition and easily recover from it, then I am very happy because that was precisely the intent of the recent changes.

Anyway, thanks a lot for testing and reporting back.

We still have no real clue what the underlying issue is, but at least here there is something you can work with.

AlexanderBykin commented 5 years ago

Well... git clone and mvn install will get you your jar.

But here's a snapshot.

I haven't tested this in anger yet.

picam-0.1.4-SNAPSHOT.zip

Rename the file to .jar rather than .zip when you download it (Github doesn't allow .jar files to be attached to posts)

hi, try to use JitPack it's really nice Free service that build and publish your project's JAR. i'm personally already use them.

caprica commented 5 years ago

Thanks for the tip, but that maven suggestion was only because it was a development snapshot - I usually release everything through Maven Central, but I don't publish snapshots there.

However, your message did remind me to push a new release out :)

AlexanderBykin commented 5 years ago

@caprica JitPack have nice features like: at any time, i can take a dependency on any of your commit or release without any of your additional actions

caprica commented 5 years ago

0.1.4 released to Maven Central today: https://search.maven.org/search?q=picam

caprica commented 5 years ago

@jcondon1 is your application single-threaded, or do you have multiple threads sharing the camera?

jcondon1 commented 5 years ago

Single threaded, camera is not being shared. Actually the app can send emails via a separate thread but the Camera is ONLY accessed by the main app thread.

caprica commented 5 years ago

So I ran a test, it wasn't 14 hours or 2.5 days or whatever, but it took 100,000 continuous pictures without any problem.

Honestly, that's good enough for me.

If someone can find some new information that helps track down the underlying problem here we can open a new issue and I'll take a look at it.

IMO there's a higher likelihood of the problem being somewhere in the native code, drivers, even hardware, there's just not enough to go on to take this any further.

AlexanderBykin commented 5 years ago

@caprica can you please create Wiki page with explanation how to collect needed debug information that next will be sent to you?

caprica commented 5 years ago

So a footnote...

I'm into the 14th hour of my test using the simplified test case posted above. I have nearly 300,000 (yes, three hundred thousand) successful captures.

And it's still going. No sign of any problem whatsoever.

jcondon1 commented 5 years ago

I just checked and my app got a hang about 1 hour ago but it recovered ok so no problem.

I’m surprised you are not able to reproduce a hang. I’m using the same hardware, V2 camera and same Raspbian OS I have used for many months with the opencv java library to take pictures and it has never hung, however the picture quality was not very good.

Wonder if the problem may be parameter sensitive, the config I am using is:

CameraConfiguration config = CameraConfiguration.cameraConfiguration() .captureTimeout(3000) // 3 second timeout .width(1280) .height(720) .delay(1000) // changed to 20 after first exposure .encoding(Encoding.JPEG) .quality(85);

After I create the camera object I am looping on: BufferedImage bImage = takePicture(camera, pictureCaptureHandler);

/-------------/ / takePicture / /-------------/ private BufferedImage takePicture(Camera camera, PictureCaptureHandler pictureCaptureHandler) { BufferedImage bImage = null;

try { camera.takePicture(pictureCaptureHandler); ByteArrayOutputStream ba = (ByteArrayOutputStream)pictureCaptureHandler.result(); byte[] data = ba.toByteArray(); ByteArrayInputStream bis = new ByteArrayInputStream(data); bImage = ImageIO.read(bis); } catch(Exception e) { System.out.println("Exception e = " + e); e.printStackTrace(); } return bImage; }

caprica commented 5 years ago

Now over 14.5 hours of continuous captures with no problem. 310,000+ captures. I'll run 30 more minutes and then I'll stop.

Honestly though I'm done with this issue, I've spent many hours on it, unless and until someone supplies clear evidence of a problem in this library.

caprica commented 5 years ago

And I got a 2.3gb log file to prove it :)

caprica commented 5 years ago

An interesting addendum to this issue...

I now have a pure JNI implementation, again it follows Raspistill pretty closely, and I actually witnessed a similar hang as has been described here at just over 3 hours of constant picture captures.

The only thing in the log around the time of the failure that looked wrong was this line:

2425270.732: mmalsrv: send_buffer_to_host: tx failed:size 292 st -1

...which appears earlier in this discussion too.

This suggests a problem in the underlying native code to me, it would certainly be remarkable although admittedly not impossible for two separate implementations in two different languages to end up with a similar issue.

caprica commented 5 years ago

Maybe that mmalsrv log message is a red-herring as during a test this morning I have that in my log with no camera hang. I may be mistaken about this.

caprica commented 5 years ago

I am going to open a new issue to track this, see #14.