OpenSkyProject / OpenSkyImager

OpenSkyImager is a capture program written for Astronomy camera operation
GNU General Public License v3.0
17 stars 6 forks source link

Segmentation fault during start of recording #17

Open nikiosna opened 7 years ago

nikiosna commented 7 years ago

I get an error if I try to start the record or focus capture, starting the program and connect to my QHY5L-II color does work well. This error occurs only on my laptop not on my pc (both with Mint 18.1). The execution says only niklas@niklas-W230SD ~ $ /usr/local/bin/OpenSkyImager/gtkImager Detected 4 cpu cores GTK M: 3, m: 18, u: 9 Speicherzugriffsfehler Speicherzugriffsfehler means "segmentation fault" or "memory acces error". Is there a way to get more detailed debugging informations?

Eventually a log from the compilation does help.

nikiosna commented 7 years ago

I created a log with valgrind wich is possible helpfull. valgrind.txt It says Process terminating with default action of signal 11 (SIGSEGV) Access not within mapped region at address 0x0

OpenSkyProject commented 7 years ago

Hello nikiosna, sorry for the long delay. I'm having a big load of chores lately, both family and work wise.

My first reaction while reading your first report is that if the same program and camera behave differently on the two machines there may be something set on one that is preventing clean operation. The first one that come to mind is: be sure that your user is in the "video" group. I'm assuming you're using udev rules file supplied with the program. If you're in doubt try to run the program with sudo and see if it makes any difference when reading image data from camera.

QHY5L-II is quite a tricky camera to deal with. I own one, I know it is. Not sure if it's because of the rolling shutter or some "feature" buried in QHY firmware, but it has been a nightmare for whomever developed code against it. Talked about this with a few at the time of develop.

So being a QHY5L-II I suggest that you start focus mode using the slowest speed setting. That is "slow download speed" and "usb speed 255".

If you already pursued the above path without success, please accept my apologies.

Looking at your log (great work that you did, thank you!):

==2890== Invalid write of size 8 ==2890== at 0x4609AC: list_del (libusbi.h:130) ==2890== by 0x4609AC: usbi_handle_transfer_completion (io.c:1574) ==2890== by 0x4662A7: handle_bulk_completion (linux_usbfs.c:2433) ==2890== by 0x4662A7: reap_for_handle (linux_usbfs.c:2652) ==2890== by 0x4662A7: op_handle_events (linux_usbfs.c:2704) ==2890== by 0x460723: handle_events (io.c:2089) ==2890== by 0x4612CE: libusb_handle_events_timeout_completed (io.c:2174) ==2890== by 0x46140F: libusb_handle_events_completed (io.c:2273) ==2890== by 0x461B90: sync_transfer_wait_for_completion (sync.c:50) ==2890== by 0x461C65: do_sync_bulk_transfer (sync.c:179) ==2890== by 0x461FEE: libusb_bulk_transfer (sync.c:256) ==2890== by 0x438EFE: qhy_getImgData_align (in /usr/local/bin/OpenSkyImager/gtkImager) ==2890== by 0x41BC8E: imgcam_readout_ext (in /usr/local/bin/OpenSkyImager/gtkImager) ==2890== by 0x41BBA7: imgcam_readout (in /usr/local/bin/OpenSkyImager/gtkImager) ==2890== by 0x43EFFB: thd_capture_run (in /usr/local/bin/OpenSkyImager/gtkImager) ==2890== Address 0x0 is not stack'd, malloc'd or (recently) free'd

I appreciate "thd_capture_run", "imgcam_readout", "imgcam_readout_ext", " qhy_getImgData_align" are part of my code, all other calls do look like libusb functions.

The "qhy_getImgData_align" is a special readout for QHY5L-II that was needed to sync frame transfer and avoid data corruption and discard. Possibly needed because of the rolling shutter this camera feature and old QHY5 doesn't. Please try to add a few printf debug calls and see if you can spot anything wrong with it. I have inherited that code from lin_guider after long testing and appreciating it was stable and effective. However you know things sometimes slip even through the more accurate test phase. The function is in "core/qhycore.c". If you're positive the offending code is in there, but you stil can't get where, you may even try not to make use of if (as versions before 0.9.5 did):

Functions "imgcam_readout", "imgcam_readout_ext" can be found into "core/ingcamio.c". Function "thd_capture_run" can be found in "gtk/imgWFuncs.c".

Since in the above block from you log there are a lot of libusb calls, you may also want (last resource) to test if using the vanilla libusb that you have on your machine can do any difference.

I had to add this custom version o libusb in version 0.8.13 because of what is commented in Versions.md.

Please let me know your mileage with this.

Best Ciao Giampiero

On Fri, Feb 17, 2017 at 3:17 PM, nikiosna notifications@github.com wrote:

I created a log with valgrind wich is possible helpfull. valgrind.txt https://github.com/OpenSkyProject/OpenSkyImager/files/783379/valgrind.txt It says Process terminating with default action of signal 11 (SIGSEGV) Access not within mapped region at address 0x0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OpenSkyProject/OpenSkyImager/issues/17#issuecomment-280661172, or mute the thread https://github.com/notifications/unsubscribe-auth/AFuMQfFp-rK1fPyg5CHLm_64L822X2ceks5rdav-gaJpZM4L0XGi .

nikiosna commented 7 years ago

A short status report: I checked that I'm in the group video and tested with root permissions -> no success. Then I added debug output in all four functions ( "thd_capture_run", "imgcam_readout", "imgcam_readout_ext", " qhy_getImgData_align") and saw nothing. None of the outputs was reached. I tried to found the point were the code starts when I press the button "Start" but did not find it sure. Is it the "cmd_run_click()" in "./gtk/imgWCallbacks.c"? I added direct at the begin of these fuction a printf and It wasn't shown neither. The only point where I achieved a output was before the button-click at "cmd_run_build()" in "gtk/imgWindow.c"

OK found why printf() don't work I have to use fprintf(stderr, "I will be printed immediately"); .

nikiosna commented 7 years ago

I located the line where the bug is. It's in the lin_guider code "qhy_getImgData_align() : core/qhycore.c : line 641" ret = libusb_bulk_transfer( hDevice, endp.bulk, databuffer + pos, to_read, &transferred, ((to_read > 15000000) ? 60000: 40000));

I tried the newest libusb with these the programm don't crash but shows no image. (Bad data received)

OpenSkyProject commented 7 years ago

Hello nikiosna,

Then I added debug output in all four functions ( "thd_capture_run", "imgcam_readout", "imgcam_readout_ext", " qhy_getImgData_align") and saw nothing. if you add a "printf" debug output, please be sure to add a "\n" at the end of the string you want to print out or they won't show on the command line until the program ends.

I located the line where the bug is. It's in the lin_guider code "qhy_getImgData_align() : core/qhycore.c : line 641" Well, that's a call to libusb function to gather a chunk of data from the camera. You see it's in a loop. Please try to print out values of:

  • pos
  • try_cnt
  • to_read before each bulk_read call. Then also:
  • transferred
  • ret after bulk_read call.

The ret value it's expected to be a number and it's one of the "libusb_error" as can be found here: http://libusb.sourceforge.net/api-1.0/group__misc.html If all is good ret should always be 0, any other value it's a sign of some error condition.

I tried the newest libusb with these the programm don't crash but show no image If you're not getting segfault using latest libusb chances are that something has been improved to manage unexpected conditions. But since you're not getting an image, there's definitely something unusual / unexpected going on with the usb data transfer.

Good luck with your debug and please keep me posted.

Best Ciao Giampiero

nikiosna commented 7 years ago

I added this debug code: printf("pos: %d\n", pos); printf("try_cnt: %d\n", try_cnt); printf("to_read: %d\n", to_read); ret = libusb_bulk_transfer( hDevice, endp.bulk, databuffer + pos, to_read, &transferred, ((to_read > 15000000) ? 60000: 40000)); printf("transferred: %d\n", transferred); printf("ret: %d\n", ret); printf("ret_name %s\n",libusb_error_name(ret));

with the cusom libusb the return is Detected 4 cpu cores GTK M: 3, m: 18, u: 9 pos: 0 try_cnt: 0 to_read: 1228805 Speicherzugriffsfehler

The newest libusb return more

Detected 4 cpu cores GTK M: 3, m: 18, u: 9 pos: 0 try_cnt: 0 to_read: 1228805 transferred: 0 ret: -8 ret_name LIBUSB_ERROR_OVERFLOW pos: 0 try_cnt: 1 to_read: 1228805 transferred: 0 ret: -4 ret_name LIBUSB_ERROR_NO_DEVICE pos: 0 try_cnt: 2 to_read: 1228805 transferred: 0 ret: -4 ret_name LIBUSB_ERROR_NO_DEVICE pos: 0 try_cnt: 3 to_read: 1228805 transferred: 0 ret: -4 ret_name LIBUSB_ERROR_NO_DEVICE

OpenSkyProject commented 7 years ago

Hello Niklas, if I read things right it does look like in newer libusb developers could trap some sort of unexpected condition better. The bulk read does not segfault, but still no data can be gathered from the device. Also after the first overflow (which really looks like a "trapped segmentation fault") the device looks like it's no longer available. This may be "normal" as these camera tend to stall if data is not all read once the download has started.

I will try and see if the patch to allow "big bulk read" is still applicable to the newest libusb or if developers addressed the "transfer limit" problem some other way. But still this would not solve your problem.

The whole thing brings us back to something weird / unusual / unexpected that is happening on the USB bus.

I hate to say so, but I think the best place to ask for help with this is the libusb support forum.

Please keep me posted with news, if you have any.

Ciao Giampiero

On Thu, Mar 2, 2017 at 12:57 PM, Niklas Kohlmeyer notifications@github.com wrote:

I added this debug code: printf("pos: %d\n", pos); printf("try_cnt: %d\n", try_cnt); printf("to_read: %d\n", to_read); ret = libusb_bulk_transfer( hDevice, endp.bulk, databuffer + pos, to_read, &transferred, ((to_read > 15000000) ? 60000: 40000)); printf("transferred: %d\n", transferred); printf("ret: %d\n", ret); printf("ret_name %s\n",libusb_error_name(ret));

with the cusom libusb the return is Detected 4 cpu cores GTK M: 3, m: 18, u: 9 pos: 0 try_cnt: 0 to_read: 1228805 Speicherzugriffsfehler

The newest libusb return more

Detected 4 cpu cores GTK M: 3, m: 18, u: 9 pos: 0 try_cnt: 0 to_read: 1228805 transferred: 0 ret: -8 ret_name LIBUSB_ERROR_OVERFLOW pos: 0 try_cnt: 1 to_read: 1228805 transferred: 0 ret: -4 ret_name LIBUSB_ERROR_NO_DEVICE pos: 0 try_cnt: 2 to_read: 1228805 transferred: 0 ret: -4 ret_name LIBUSB_ERROR_NO_DEVICE pos: 0 try_cnt: 3 to_read: 1228805 transferred: 0 ret: -4 ret_name LIBUSB_ERROR_NO_DEVICE

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenSkyProject/OpenSkyImager/issues/17#issuecomment-283634667, or mute the thread https://github.com/notifications/unsubscribe-auth/AFuMQROW9jlR2v44Cn2_uHZBhgoijphDks5rhq66gaJpZM4L0XGi .

nikiosna commented 7 years ago

Strange tried to reinstall my whole laptop, tested lin_guider and indi nothing can handle the camera on my laptop. But with indi I can use the cam with my raspberry remotly.

OpenSkyProject commented 7 years ago

Hello Niklas, not sure I get everything right.

You mean you can get images when camera are connected to your RasPi and you read data from the laptop using Indi? If that's the case, then image is readout on the RasPi so you're not using your laptop USB bus.

Guess that also lin_guider and OSI will run successfully on your RasPi if there's a full blown distro with gui.

If you still need to use camera on your laptop I guess you have to investigate the issue with libusb people. Maybe they can help you gather whats' wrong by reading some logging ( http://www.cs.unm.edu/~hjelmn/libusb_hotplug_api/index.html see Debug message logging).

I'm really sorry I can't help more.

Ciao Giampiero

On Thu, Mar 2, 2017 at 5:21 PM, Niklas Kohlmeyer notifications@github.com wrote:

Strange tried to reinstall my whole laptop, tested lin_guider and indi nothing can handle the camera on my laptop. But with indi I can use the cam with my raspberry remotly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/OpenSkyProject/OpenSkyImager/issues/17#issuecomment-283700653, or mute the thread https://github.com/notifications/unsubscribe-auth/AFuMQfm7yBN9iQkpCJ3D7Rl9GkoM0x7Kks5rhux2gaJpZM4L0XGi .