Deep-Symmetry / beat-link

Java library for synchronizing and communicating with Pioneer DJ Link equipment
Eclipse Public License 2.0
207 stars 24 forks source link

ConnectionManager loses db server port when CDJ disconnects, but won't gain it again #35

Closed ben-xo closed 5 years ago

ben-xo commented 5 years ago

ConnectionManager -1s the db server port when the relevant CDJ disconnects in the deviceLost() method. However, the paired deviceFound() method is only triggered the first time a device is seen (as DeviceFinder::deliverFoundAnnouncement() has a guard which only delivers if the device is new, on line 220). The upshot is that if the CDJ disconnects and then rejoins the network, its db server port is stuck at -1.

This manifests as a sudden spewing of messages from MetadataFinder (once per CdjStatus packet) all saying the same thing:

ERROR [MetadataFinder metadata request] (MetadataFinder.java:114) - Problem requesting metadata, returning null
java.lang.IllegalStateException: Player 2 does not have a db server requesting metadata
    at org.deepsymmetry.beatlink.dbserver.ConnectionManager.allocateClient(ConnectionManager.java:113)
    at org.deepsymmetry.beatlink.dbserver.ConnectionManager.invokeWithClientSession(ConnectionManager.java:192)
    at org.deepsymmetry.beatlink.data.MetadataFinder.requestMetadataInternal(MetadataFinder.java:112)
    at org.deepsymmetry.beatlink.data.MetadataFinder.access$1100(MetadataFinder.java:41)
    at org.deepsymmetry.beatlink.data.MetadataFinder$8.run(MetadataFinder.java:1978)
    at java.lang.Thread.run(Thread.java:748)

version: 0.5.2

brunchboy commented 5 years ago

Thanks for the report. I’m traveling dealing with some family issues for a bit so I probably won’t be able to work on this in the next few days (not even sure when I’ll get WiFi for my laptop, and no Pioneer gear with me) but this is a good clue towards fixing a long-standing problem!

ben-xo commented 5 years ago

Hey no worries, hope everything's okay. I'll take a look at it as well. To me it feels a bit like it might be a race condition in the lifecycles.

It's not 100% clear from the docs in what order (or even, which of) the main singletons you need to start(). They all seem to instantiate each other when needed, and then listen to each others lifecycles to tear themselves down; and I feel as if the problem is somewhere in that area.

I have a couple of ideas I'm checking out, but there are some parts of the lifecycle that could just do with a little more logging (such as DeviceFinder::expireDevice()).

I'm have a few hunches, but the main one is that perhaps DeviceFinder::expireDevice() races with deviceFound() (they're both threaded after all), and maybe sometimes fire in the wrong order?

I am reliably able to reproduce the problem if I power-cycle the CDJ; and can only then fix it by cycling my app.

ben-xo commented 5 years ago

I also have a related issue (although not identical): if i start my app first then power-on the CDJ second (or: connect the network cable to an already-on CDJ), then the first exception I get is a socket-timeout:

2019-10-05 11:44:26,520  INFO [main] (Application.java:189) - Starting VirtualCDJ…
2019-10-05 11:44:26,522  INFO [main] (VirtualCdj.java:584) - Found matching network interface en8 (en8), will use address /169.254.55.40/16 [/169.254.255.255]
2019-10-05 11:44:27,873  INFO [main] (Application.java:213) - Starting MetadataFinder…

[ first exception ]

2019-10-05 11:44:37,876  WARN [main] (ConnectionManager.java:292) - Problem requesting database server port number
java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at org.deepsymmetry.beatlink.dbserver.ConnectionManager.requestPlayerDBServerPort(ConnectionManager.java:279)
    at org.deepsymmetry.beatlink.dbserver.ConnectionManager.start(ConnectionManager.java:463)
    at org.deepsymmetry.beatlink.data.MetadataFinder.start(MetadataFinder.java:1358)
    at am.xo.cdjscrobbler.Application.startVirtualCdj(Application.java:218)
    at am.xo.cdjscrobbler.Application.start(Application.java:138)
    at am.xo.cdjscrobbler.Application.main(Application.java:82)
2019-10-05 11:44:37,879  INFO [main] (MetadataFinder.java:351) - Processing device found, number:2, name:"CDJ-2000".
2019-10-05 11:44:37,879  INFO [main] (Application.java:140) - Starting QueueProcessor…
2019-10-05 11:44:38,034  INFO [Thread-1] (MetadataFinder.java:884) - Reporting media mounted in SlotReference[player:2, slot:USB_SLOT]
2019-10-05 11:44:38,042  INFO [Thread-1] (MetadataFinder.java:884) - Reporting media mounted in SlotReference[player:2, slot:CD_SLOT]

[ second exception ]

2019-10-05 11:44:38,044 ERROR [MetadataFinder metadata request] (MetadataFinder.java:124) - Problem requesting metadata, returning null
java.lang.IllegalStateException: Player 2 does not have a db server requesting metadata
    at org.deepsymmetry.beatlink.dbserver.ConnectionManager.allocateClient(ConnectionManager.java:113)
    at org.deepsymmetry.beatlink.dbserver.ConnectionManager.invokeWithClientSession(ConnectionManager.java:192)
    at org.deepsymmetry.beatlink.data.MetadataFinder.requestMetadataInternal(MetadataFinder.java:122)
    at org.deepsymmetry.beatlink.data.MetadataFinder.access$500(MetadataFinder.java:39)
    at org.deepsymmetry.beatlink.data.MetadataFinder$7.run(MetadataFinder.java:1312)
    at java.lang.Thread.run(Thread.java:748)

[... this second exception then loops twice a second ...]
ben-xo commented 5 years ago

Update: I have found that if I drop DeviceFinder::MAXIMUM_AGE down to 2 seconds (instead of 10), then the problem happens a lot less often.

I think this is what's happening (although I base this only on reading the code and looking at log output):

Shortening the timeout for the expiry announcement isn't the proper fix for this. I have a suspicion that part of the problem may be that after one round of device disconnections, MetadataFinder is first in the announcement queue, before virtual cdj - it would usually be the other way round.

ben-xo commented 5 years ago

Created pull request #37 . You don't necessarily have to merge it as-is, as it feels like there might be solution that isn't quite so… zero-knowledge. Nevertheless, this fixes it well enough for me, although a network disconnect still incurs a roughly 30 second penalty. But at least now it recovers of its own accord!

brunchboy commented 5 years ago

There is definitely a lot more state and synchronized methods scattered around in this stuff than I like, I sometimes think about rewriting it all with a cleaner design now that we know what works and what doesn’t (maybe even in Kotlin so we can just use functions instead of singletons for some things), but on the other hand I always have more interesting things to pursue and new features to add. Thanks for the PR, and I am pretty sure the issue you are working on here is the one described at the start of https://github.com/Deep-Symmetry/beat-link-trigger/issues/18

ben-xo commented 5 years ago

What a small world, I also happen to know Marcus Krooked!

ben-xo commented 5 years ago

Regarding the architecture - (just throwing ideas out) it would probably make sense to introduce a "control plane" device / class/ thread that owned all of the instances and responded to their lifecycles, rather than having the lifecycle of each device work independently. That would mean all of your world-inspecting interactions happen in one place.

brunchboy commented 5 years ago

That would be a very different project. The idea here is to allow you to use whatever pieces you need, and have them manage their own dependencies as much as they can. Something like Clojure’s immutable data structures and atomic update functions would help make this a lot less fraught.

Anyway, to help me better understand what is happening here, how many CDJs do you have on your network? And what exactly is happening? In the initial bug report it sounded like you were taking a CDJ off the network. But in your comment about dropping MAXIMUM_AGE to 2 seconds you talk about the entire network going away and coming back, which is a completely different scenario. That should probably be a separate issue, so we can focus on one thing at a time? But I am going to do some of my own experimentation now and see what I can find.

brunchboy commented 5 years ago

Also, I take it that you are using a real player number in your application. Does Crate Digger ever succeed at downloading the exported database and start offering the metadata that way? Actually, are you even starting the CrateDigger singleton? It occurs to me that you might not be, and this is a newer and faster way of getting metadata, after an initial pause for it to download and parse the entire database file from the NFS server running in the CDJs. It is, in fact, the only way that you can obtain metadata when there are four physical CDJs on the network. Once you have it running, and once it has got the data for the players’ mounted media, it registers itself with the MetadataFinder and handles metadata requests for those media.

brunchboy commented 5 years ago

I can absolutely reproduce the problem of the ConnectionManager getting its database port number stuck at -1 if I power off a player for a while, so I am adding more logging to see how to fix that. But the reason I haven’t been running into this issue myself is that I have been using the CrateDigger approach to getting metadata for almost a year now, and you should too. I need to add some pointers to the MetadataFinder documentation encouraging people to always also call org.deepsymmetry.beatlink.data.CrateDigger.getInstance().start() whenever they fire up the MetadataFinder.

brunchboy commented 5 years ago

Ok, you’re not going to believe what is happening here. It turns out that this isn’t any sort of lifecycle management problem or race condition in Beat Link. The ConnectionManager is properly receiving the device found announcement, and asking the player what port to use to reach dbserver, and the player itself is replying -1 to us, and we are recording that in the table! Apparently if you ask a player too soon after it has powered up, it is not yet running a dbserver and it responds this way!

ben-xo commented 5 years ago

Hey sorry for the late replies on this! Great discovery and glad that seems to be an appropriate fix. Sorry that I waffled through several states of mind earlier on in the thread. I'll test the fix and get back to you.

Regarding my exact set-up and what is happening to trigger bugs: I actually tested three scenarios:

You are right about the tweak to MAXIMUM_AGE being a separate issue. It may not be relevant and i'll file a new ticket if it is.

You are correct that I was not using CrateDigger - I didn't know about it. Thank you for the tip!