OpenLightingProject / ola

The Open Lighting Architecture - The Travel Adaptor for the Lighting Industry
https://www.openlighting.org/ola/
Other
644 stars 204 forks source link

Occasional hangs when doing full discovery via the ENTTEC USBDMX Pro #341

Closed Drachenkaetzchen closed 9 years ago

Drachenkaetzchen commented 10 years ago

Hi,

I get occasional hangs when doing full discovery via the ENTTEC USBDMX Pro. The log shows the following:

user.info: Feb 19 16:00:38 olad: EnttecUsbProWidget.cpp:280: Full discovery triggered user.info: Feb 19 16:00:38 olad: EnttecUsbProWidget.cpp:324: Un-muting all devices user.info: Feb 19 16:00:38 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0000:00000000 - ffff:ffffffff

After that, it hangs forever. I see several "break too short" messages using the logic sniffer. I need to restart the OLA daemon in order to use OLA again. The issue might be caused by noise on the line or a timing issue with one of my RDM responders, however, OLA should handle this gracefully and not simply hang forever.

nomis52 commented 10 years ago

What version is this running ? Can you please try with the git version.

Drachenkaetzchen commented 10 years ago

This is the latest git version as of today.

Drachenkaetzchen commented 10 years ago

Another hang occured today:

user.info: Feb 24 12:09:07 olad: Universe.cpp:499: Full RDM discovery triggered for universe 0
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:280: Full discovery triggered
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:324: Un-muting all devices
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0000:00000000 - ffff:ffffffff
user.info: Feb 24 12:09:07 olad: DiscoveryAgent.cpp:396:  collision, spliting into: 0000:00000000 - 7fff:ffffffff , 8000:00000000 - ffff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 8000:00000000 - ffff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0000:00000000 - 7fff:ffffffff
user.info: Feb 24 12:09:07 olad: DiscoveryAgent.cpp:396:  collision, spliting into: 0000:00000000 - 3fff:ffffffff , 4000:00000000 - 7fff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 4000:00000000 - 7fff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0000:00000000 - 3fff:ffffffff
user.info: Feb 24 12:09:07 olad: DiscoveryAgent.cpp:396:  collision, spliting into: 0000:00000000 - 1fff:ffffffff , 2000:00000000 - 3fff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 2000:00000000 - 3fff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0000:00000000 - 1fff:ffffffff
user.info: Feb 24 12:09:07 olad: DiscoveryAgent.cpp:396:  collision, spliting into: 0000:00000000 - 0fff:ffffffff , 1000:00000000 - 1fff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 1000:00000000 - 1fff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0000:00000000 - 0fff:ffffffff
user.info: Feb 24 12:09:07 olad: DiscoveryAgent.cpp:396:  collision, spliting into: 0000:00000000 - 07ff:ffffffff , 0800:00000000 - 0fff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0800:00000000 - 0fff:ffffffff
user.info: Feb 24 12:09:07 olad: DiscoveryAgent.cpp:261: preamble 4 fc
user.info: Feb 24 12:09:07 olad: DiscoveryAgent.cpp:396:  collision, spliting into: 0800:00000000 - 0bff:ffffffff , 0c00:00000000 - 0fff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0c00:00000000 - 0fff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0800:00000000 - 0bff:ffffffff
user.info: Feb 24 12:09:07 olad: DiscoveryAgent.cpp:396:  collision, spliting into: 0800:00000000 - 09ff:ffffffff , 0a00:00000000 - 0bff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0a00:00000000 - 0bff:ffffffff
user.info: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0800:00000000 - 09ff:ffffffff
user.warn: Feb 24 12:09:07 olad: EnttecUsbProWidget.cpp:476: multiple discovery responses received, ignoring all but the first.

strace on the ola daemon reveals the following lines, repeating:

select(32, [5 11 15 22 23 24 25 26 27 28 29 31], [], NULL, {1, 999832}) = 1 (in [15], left {0, 830559})
ioctl(15, FIONREAD, [22])               = 0
read(15, "\22\0\0\20", 4)               = 4
read(15, "\10\1\20H\32\nGetPlugins\"\0", 18) = 18
write(18, "\r\1\0\20\10\2\20H\"\206\2\n\v\10\1\22\5Dummy\30\1\n\f\10\2\22\6Ar"..., 273) = 273
select(32, [5 11 15 22 23 24 25 26 27 28 29 31], [], NULL, {0, 830200}) = 1 (in [15], left {0, 830196})
ioctl(15, FIONREAD, [27])               = 0
read(15, "\27\0\0\20", 4)               = 4
read(15, "\10\1\20I\32\17GetUniverseInfo\"\0", 23) = 23
write(18, "\36\0\0\20\10\2\20I\"\30\n\26\10\0\22\nUniverse 0\30\2 \0(\1"..., 34) = 34
select(32, [5 11 15 22 23 24 25 26 27 28 29 31], [], NULL, {0, 830022}) = 0 (Timeout)
sendto(25, "\1\0\274\256\305\n\\l\0\5\5\1\2\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 237, 0, {sa_family=AF_INET, sin_port=htons(37895), sin_addr=inet_addr("237.1.1.1")}, 16) = 237
select(32, [5 11 15 22 23 24 25 26 27 28 29 31], [], NULL, {1, 922649}) = 0 (Timeout)
select(32, [5 11 15 22 23 24 25 26 27 28 29 31], [], NULL, {0, 74948}) = 0 (Timeout)
sendto(25, "\1\0\274\256\305\n\\l\0\5\5\1\2\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 237, 0, {sa_family=AF_INET, sin_port=htons(37895), sin_addr=inet_addr("237.1.1.1")}, 16) = 237
select(32, [5 11 15 22 23 24 25 26 27 28 29 31], [], NULL, {1, 955768}) = 0 (Timeout)
sendto(29, "\355\1\2\0\0\1\0\0\0\0\0\0(S\330\354\357\377\355\377\3\2\0\f(S\330\354\254\26u\214"..., 36, 0, {sa_family=AF_INET, sin_port=htons(3792), sin_addr=inet_addr("239.255.237.2")}, 16) = 36
select(32, [5 11 15 22 23 24 25 26 27 28 29 31], [], NULL, {0, 41614}) = 0 (Timeout)
sendto(25, "\1\0\274\256\305\n\\l\0\5\5\1\2\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 237, 0, {sa_family=AF_INET, sin_port=htons(37895), sin_addr=inet_addr("237.1.1.1")}, 16) = 237
select(32, [5 11 15 22 23 24 25 26 27 28 29 31], [], NULL, {1, 999849}) = 1 (in [15], left {0, 836314})
ioctl(15, FIONREAD, [22])               = 0
read(15, "\22\0\0\20", 4)               = 4
read(15, "\10\1\20J\32\nGetPlugins\"\0", 18) = 18
write(18, "\r\1\0\20\10\2\20J\"\206\2\n\v\10\1\22\5Dummy\30\1\n\f\10\2\22\6Ar"..., 273) = 273
select(32, [5 11 15 22 23 24 25 26 27 28 29 31], [], NULL, {0, 835909}) = 1 (in [15], left {0, 835904})
Drachenkaetzchen commented 10 years ago

I've disabled all plugins, and closed the web frontend. Discovery still hangs sometimes, and strace shows the following:

select(23, [5 11 15 22], [], NULL, {9, 999994}) = 0 (Timeout)
select(23, [5 11 15 22], [], NULL, {9, 999993}) = 0 (Timeout)
select(23, [5 11 15 22], [], NULL, {9, 999993}) = 0 (Timeout)
select(23, [5 11 15 22], [], NULL, {9, 999994}) = 0 (Timeout)
select(23, [5 11 15 22], [], NULL, {9, 999994}) = 0 (Timeout)
select(23, [5 11 15 22], [], NULL, {9, 999993}) = 0 (Timeout)
select(23, [5 11 15 22], [], NULL, {9, 999992}) = 0 (Timeout)
nomis52 commented 10 years ago

We tracked this down to the Enttec widget not returning a '12' message after a DUB. It seems to happen with a hub, when a device is on each arm of the hub.

See http://paste.debian.net/83857/, in particular the last few lines:

user.info: Feb 24 18:20:35 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 2000:00000000 - 3fff:ffffffff user.info: Feb 24 18:20:35 olad: EnttecUsbProWidget.cpp:828: Sending message with label 11, size 38 user.info: Feb 24 18:20:35 olad: EnttecUsbProWidget.cpp:858: received message from widget, label 12, size 0 user.info: Feb 24 18:20:35 olad: EnttecUsbProWidget.cpp:343: Sending DUB packet: 0000:00000000 - 1fff:ffffffff user.info: Feb 24 18:20:35 olad: EnttecUsbProWidget.cpp:828: Sending message with label 11, size 38 user.info: Feb 24 18:20:35 olad: EnttecUsbProWidget.cpp:858: received message from widget, label 5, size 18

alvangee commented 9 years ago

Just updated to version 0.9.5 from git and got ENTTEC DMX USB PRO Mk2 hanging. It happens not only on Full Discovery, but occasionally when requesting parameters in Web UI. Couldn't run RDM tests as it hangs at the very first one. Before upgrading to 0.9.5 everything was working fine, including RDM tests. I use VMWare on Win8 64bit Is there a way to downgrade back to 0.9.4 or resolve this issue in 0.9.5?

nomis52 commented 9 years ago

@alvangee do you have the logs ? Please run with debug mode (-l 4) .

alvangee commented 9 years ago

@nomis52 Well, sorry for bothering. Looks like it was problem in the bus so controller was thinking there's a collision, thus constantly trying to get correct response hanging in Discovery forever. Now it works fine, including RDM tests.

nomis52 commented 9 years ago

Can you reproduce it at all? I'm convinced we have a bug here so we need to understand it to get it fixed.

Simon

alvangee commented 9 years ago

I'll try to get it hanging again. BTW, not very familiar with Linux and it's file system and all that. After running OLA with -l 4 where do I get log file? And should I show it here or send it via e-mail to you?

peternewman commented 9 years ago

Isn't this similar to the problems Arjan was having when he wasn't returning replies as the Enttec unit?

nomis52 commented 9 years ago

@alvangee : if you're running the git version, you can do:

olad -l 3 2>&1 | tee /tmp/olad.log

That will log the /tmp/olad.log as well as printing everything on the screen. Once you've triggered the issue you can use Control-C to stop olad. The log file will be in /tmp/olad.log, you can either upload it somewhere and paste the link or email it to me directly.

If you have questions jump into the IRC chat room and there should be someone around to help you.

alvangee commented 9 years ago

Apparently, a very easy way to hang PRO Mk2 in Discovery is to connect responders' RS485 A+ and B- lines together. Our responder has two parallel RS485 inputs which are connected, so you can have multiple devices on a bus. Now, I connect one pair of inputs to PRO Mk2, connect two wires to another pair of inputs and when this wires touch each other - here we go.

The log here shows first succesful Discovery, when lines were not in contact, then I connect them and run Discovery again - this time it hangs forever. If I disconnect RS485 lines during this hanging it's no use, OLA stays in Discovery until you stop OLA with Control-C or close terminal.

Here's the log: http://paste.debian.net/163301

nightrune commented 9 years ago

@alvangee Can you make a quick sketch of the wiring in the system?

nomis52 commented 9 years ago

This is very helpful thanks, it's really starting to sound like a bug in the Enttec unit.

Can you capture what's happening on the RS485 line? Do you have a logic analyser ?

I've pushed an update to my repo: https://github.com/nomis52/ola.git which adds logging for received messages. Do you mind building that and running the failure case again, possibly with the logic capture.

nomis52 commented 9 years ago

The change is in the main repo now.

alvangee commented 9 years ago

I don't have a logic analyser, but I have two Pro Mk2s, so could use the second with Enttec RDM Sniffer - maybe this will show something. Due to time zones difference (I'm in Moscow, Russia), I'll be able to do this only tomorrow, in about 14-15 hours from now. I'll provide a sketch of wiring too.

nomis52 commented 9 years ago

Thanks. Once I can reproduce it we can file a bug report with Enttec.

alvangee commented 9 years ago

Here we go. I connected two Enttec Mk2 to responder, one used with OLA, second was used with RDM Sniffer. Log of OLA: http://paste.debian.net/163467 Log of Sniffer: http://paste.debian.net/163469

Looks like Mk2 just stops sending anything on bus, as there's nothing in Sniffer log and the LED on Mk2 doesn't blink Green (as it does when there's something on DMX line 1).

Now, how to reproduce this. Here's the PCB of our Responder: responder_pcb You can see X3 and X4 are RS485 lines. This time I connected one Mk2 to X3 (this one works with OLA) and another Mk2 to X4 (this one works with Sniffer). Then I start OLA and it finds Mk2, creates a Universe, runs Discovery and finds the Responder. Then I request some parameters and run full Discovery couple more times just to show it works fine. Then I connect X3's RS+ and RS- lines (simply short them), run Discovery and it hangs with nothing going from Mk2's lines. Disconnecting RS+ and RS- doesn't help, the OLA just hangs in Discovery forever. Pressing ESC button doesn't help either - even if it stops showing message "RDM Discovery running", the OLA can't refresh any of the parameters or run Discovery again - it shows this message again until ESC is pressed. The only way to get OLA working is to press Ctrl+C to stop it and then restart it. On CTRL+C this message shows up: image

All our responders have double RS485 lines, so I can't check if this happens with single RS input if you short it's RS+ and RS- lines, but it probably will.

Hell, I hope you understand my English and how I got it hanging :) If you have any questions, please ask.

nomis52 commented 9 years ago

Thanks, we're getting closer.

Can you run another test for me? Please pull in the changes from https://github.com/nomis52/ola.git which adds transaction number logging.

Then can you set it up as follows:

Mk2 (OLA) -----> Mk2 (sniffer ) -------> X3

Then do what you did before, run discovery, short X3 and run discovery again.

I don't have a Mk2 but I'll try to reproduce this with the USB Pro tonight.

alvangee commented 9 years ago

Sorry, we need to take a break for weekend. I'm off the work already. I'll run this test on Monday (after 10 am Moscow time).

nomis52 commented 9 years ago

That's fine. No rush.

alvangee commented 9 years ago

Sorry it took so long, second Mk2 was busy with another project.

I must say I'm not very good at using git in Linux via command line. I only used it on Win with GUI before.

On VMware's Ubuntu I use this procedure from tutorial to update OLA: git pull autoreconf -i ./configure --enable-python-libs make make check sudo make install

So I tried to pull using "git pull https://github.com/nomis52/ola.git", but then I was asked about merging, commiting and and told that I need to write commit message. As I wasn't sure what exactly I was doing, I decided to just cancel it.

So please, if you still need me to run this test using that special version of OLA, please tell me how to pull it and not to ruin anything.

Anyway, I tied two Mk2s as you said (that's different order from what I used before) and here's logs. OLA log: http://paste.debian.net/164491 Sniffer log: http://paste.debian.net/164496

There're transaction numbers in sniffer log.

nomis52 commented 9 years ago

It would have been safe to merge.

In any case, the changes are now in the mainline so you can just run 'git pull' to pick them up.

nomis52 commented 9 years ago

No worries, it would have been safe to merge but it's good to be cautious.

alvangee commented 9 years ago

Done. Here's OLA log: http://paste.debian.net/164688 Here's Sniffer log: http://paste.debian.net/164691 I did several normal Discoveries, then the last one was with shorted X3's RS+ and RS-.

One more thing. In Sniffer log you can see two last messages have RDM Errors Packet Type. It's not seen in log, but in Sniffer window this messages also had "bad timing" marks and unusual duration. I run the test three more times and always, when I short X3, this messages of last hung Discovery are similar - bad timing and long duration. I did the screenshot of last of my tests - it's not the same I got the logs with, but shows exactly the same thing: sniffer_window

nomis52 commented 9 years ago

alvangee: do you want to drop me an email so I have your address and then I can cc you on the thread with Enttec.

I can't reproduce this with my USB Pro (Mk1) so I suspect it's a Mk2 issue.

In the meantime I'll see if we can work around it in OLA.

alvangee commented 9 years ago

I made my email visible.

nomis52 commented 9 years ago

Thanks, I've started a thread with Enttec (you're cc'ed) and have https://github.com/OpenLightingProject/ola/pull/697 which tries to detect the dropped response. Once that change is merged, can you rebuild and see if you get a "Enttec watchdog fired" message in the logs.

alvangee commented 9 years ago

I ran test several times. On fifth time I finally got "Enttec watchdog fired" message on line 771: http://paste.debian.net/165200 One more test with the message ran with olad -l 3, not -l 4: http://paste.debian.net/165201

roelofsw commented 9 years ago

Greetings, I've stumbled onto this thread while playing with Ja-Rule and the same Enttec unit, and getting watchdog fired errors. Has there been any resolution from Enttec on this issue? Kind regards,

nomis52 commented 9 years ago

No, no response. I'm going to try to get access to a MK2 unit so I can reproduce the issue.

nomis52 commented 9 years ago

I've got hold of an MK2 unit, now I'm just trying to reproduce the failure.

nomis52 commented 9 years ago

Confirmed with firmware 4.10. I'll chat to Enttec today and see if we can work around it in OLA.