RobotLocomotion / director

A robotics interface and visualization framework, with extensive applications for working with http://drake.mit.edu
BSD 3-Clause "New" or "Revised" License
178 stars 86 forks source link

Supporting zeromq for drake-visualizer #586

Open rdeits opened 6 years ago

rdeits commented 6 years ago

I'm still using drake-visualizer (in RemoteTreeViewer mode) as my visualizer daily, and it's great, but I'm finding more and more that I'm running into issues trying to visualize meshes or pointclouds that are too large for LCM. We could try to fix LCM to enable message chunking, but since the tree viewer is already using its own message encoding, it's really not getting any benefit from LCM at all. @patmarion I know we talked about using ZMQ in the past, and I think being able to use tcp or ipc would probably resolve the issues I'm having with LCM+udp. As I recall, the issue was needing to spawn a new event loop to handle the ZMQ requests, since the threading python module doesn't work in Director. Is that still the case?

How would I go about making this happen?

patmarion commented 6 years ago

I think threading can be used in Director successfully. Last year I wrote a class named TaskRunner that helps run threads in Director:

from director.taskrunner import TaskRunner

def test():
    while True:
        print 'on thread'
        time.sleep(1.0)

t = TaskRunner()
t.callOnThread(test)

(note, this class was only in spartan, but I just merged a PR to add it to director/master)

The tricky part with threads in Director is that the Qt C++ event loop has control, not the Python runtime. The python thread starts but has limited opportunities to get scheduled, so your thread will be sluggish. The other problem is that interacting with C++ objects (like the vtk visualization window) is best done exclusively from the main python thread.

The TaskRunner solves this problem by ensuring the Qt event loop pumps the Python runtime. Thread performance won't be quite as good as if you run your program directly with the python interpreter, but it works pretty well for most purposes, and I think it works great if your thread just does blocking IO like zeromq / sockets.

Sample (written for python3):

import sys
import time
import zmq

def server():
    context = zmq.Context.instance()
    sock = context.socket(zmq.REP)
    sock.bind('tcp://*:8089')

    while True:
        message = sock.recv().decode('utf-8')
        print('received message:', message)
        sock.send_string('ack ' + message)

def client():
    context = zmq.Context.instance()
    sock = context.socket(zmq.REQ)
    sock.connect('tcp://localhost:8089')

    for i in range(100):
        message = 'message {}'.format(i)
        sock.send_string(message)
        response = sock.recv().decode('utf-8')
        print('response:', response)
        time.sleep(1.0)

if __name__ == '__main__':

    from director import taskrunner

    taskRunner = taskrunner.TaskRunner()
    taskRunner.callOnThread(server)
    taskRunner.callOnThread(client)
rdeits commented 6 years ago

This is perfect, thanks! I'll try it out today.

rdeits commented 6 years ago

Cool, it works! I'm getting some warnings when I try to actually create geometry from the taskrunner thread, though:

QPixmap: It is not safe to use pixmaps outside the GUI thread
QObject::setParent: Cannot set parent, new parent is in a different thread

Is that to be expected, given what you mentioned about interacting with vtk from the python main thread? The geometry does show up correctly, despite the warnings. My implementation is here: https://github.com/RobotLocomotion/director/compare/master...rdeits:treeviewer-zmq?expand=1

rdeits commented 6 years ago

Wow, this is already amazing. Using ZeroMQ with MsgPack + msgpack-numpy for arrays, I can serialize, transmit, unserialize, and render 1,000,000 points in 50ms.

patmarion commented 6 years ago

You could ignore the warnings, but i think they are correctly identifying potential issues where your thread is interacting with objects owned by the main thread. You thread and the main thread aren't running concurrently due to the way Python schedules things, but it could still be a problem.

If you want to avoid issues like this, a good strategy is to use your thread just for zeromq blocking IO, but do all the message processing on the main thread. The TaskRunner has a helper for this:

def threadFunction():
  message = waitForMessage()
  taskRunner.callOnMain(lambda: processMessage(message))

callOnMain returns instantly, all it does is schedule your function to be called on the main thread eventually. The TaskRunner uses a 60hz timer to periodically call these scheduled functions. Or, you can manage your own timer to do periodic processing:

msgs = []

def processPendingMessages():
  while msgs:
    msg = msgs.pop()

def threadFunction():
  while True:
    time.sleep(1.0)
    msgs.append('message')

# produce messages on thread
taskRunner.callOnThread(threadFunction)

# periodically process messages on main
timer = TimerCallback(callback=processPendingMessages, targetFps=60)
timer.start()
patmarion commented 6 years ago

btw, i fixed the cdash issue for director's travis-ci, so now travis-ci passes and uploads binaries to bintray again. if you need binaries with taskrunner they are:

https://bintray.com/patmarion/director/director/0.1.0-266-g071a233#files

rdeits commented 6 years ago

Got it, thanks!

rdeits commented 6 years ago

Ok, I've nearly got it all working properly. One issue I noticed is that sending a large number of draw commands bogs down and eventually crashes the whole Director app. I've traced it down to the fact that TaskRunner.callOnMain() calls self.timer.start() every time. That seems to cause some cumulative degradation of the app performance and eventually crashes all of Director. It looks like this is because TimerCallback.isActive() is always returning False inside the thread callback, so every call to start() results in a new call to self.timer.connect().

Removing the call to self.timer.start() inside callOnMain() totally resolves the issue, but I'm not sure if it's the right thing to do.

patmarion commented 6 years ago

I am not able to repeat this issue, but I think I must have some issues with the implementation of callOnMain(), it's implemented in a more complicated way than it should be (using another class I wrote called asynctaskrunner.

I am going to recommend that you implement it without using callOnMain() and instead use the pattern that I showed in the 2nd example of this comment:

https://github.com/RobotLocomotion/director/issues/586#issuecomment-362017527

In that pattern, there are no timers starting and stopping, just one timer that is started once.

patmarion commented 6 years ago

can you try this diff:

index e00f64e..740bfd3 100644
--- a/src/python/director/taskrunner.py
+++ b/src/python/director/taskrunner.py
@@ -22,6 +22,7 @@ class TaskRunner(object):
         self.pendingTasks = []
         self.threads = []
         self.timer = TimerCallback(callback=self._onTimer, targetFps=1/self.interval)
+        self.timer.disableScheduledTimer()

         # call timer.start here to initialize the QTimer now on the main thread
         self.timer.start()

if you are building from source, don't forget to run make again after modifying the python source

rdeits commented 6 years ago

Thanks! Both of your suggestions fixed the issue. I think I like the pattern from https://github.com/RobotLocomotion/director/issues/586#issuecomment-362017527 a bit better, so I'll probably go with that.