Open brtietz opened 9 years ago
Multi-threaded programming is an area I would be happy to (try and) help with if you need any assistance.
It's tough for me to say whether I could be of any help immediately, but I have a decent understanding of concurrent programming and would love to learn more. One potential idea (depending on how much data needs to be transferred for the runs to synchronize with each other) is to make it distributable. I'm not sure how helpful additional computing time would be, but collectively I could probably contribute 4-6 systems that are currently idle for most of the day.
Currently it takes ~8kB to specify one of my controllers (in text files), its a couple hundred doubles. Specifying scores (the results) is only another double or two. This could grow or shrink, depending on whether I find ways to reduce the number of parameters, how soon we can incorperate zeroMQ, and how much data it takes to specify a structure once I start learning on those. I don't expect it would ever be more than 1 mB per trial. Each trial takes between 10 and 100 seconds and sends a single core to 100%, so sounds quite distributable.
If we can get it working multi-core first, it will take what I'm doing now from 'over a weekend' down to 'overnight' which is a huge difference. If going from that to fully distributed isn't too much extra work, it would be great to have a few extra machines and get the runtime down to a long lunch. ;)
I should mention upfront that I wouldn't be able to do any implementation until the 14th of this month, but I'll have a lot of time during that week between midterms being over and reading week. Assuming that's okay, here are some of my ideas.
If structured properly, allowing this to be distributed across many systems shouldn't be difficult. There are some design decisions which will make it more/less difficult, though.
The approach that's easiest to implement is to use one NTRT process per thread. So the existing code remains single-threaded and simply emits the data it creates now. That data would go into a Python script (presumably) and the Python script would be responsible for synchronizing data. We could make it even easier if to implement if we specify that an NTRT process is spawned to run a trial, and once that trial is complete that NTRT process ends (i.e. the Python script spawns an NTRT process, waits for it to finish, the NTRT process finishes and closes, then the Python script processes the data the trial produced and spawns another NTRT instance for the subsequent trial).
Benefits of this approach:
Downside:
As mentioned, if designed properly allowing this to run across multiple systems is easy. Here are a few key points/questions:
1) The more similar the local and global solutions look, the easier this is to distribute. So, I ask: with your current approach, how do you start trials and how do you collect data? Does your Python code simply generate a controller with the new parameters, then launch an NTRT process, then once the NTRT trial is finished it processes the data that came out and starts another trial? If that's the case, I would suggest we not use ZeroMQ's messaging immediately (I should mention that ZeroMQ does offer the ability to send messages via TCP, so using ZeroMQ even with a globally distributed solution is possible). ZeroMQ would obviously work, but if there is such a small amount of communication between the Python code and your trial, and you already have it written to just write data out to disk we should just leave it that way. I don't see ZeroMQ offering us any significant advantages here (hopefully Ryan can correct me if I'm overlooking something).
2) As far as sharing computing power, the easiest way to do that is to have a script which can build a separate NTRT install and we have a little Python process that waits in the background. If the system is idle for > x minutes, we'll make that system available to help run trials (it'll register with the master and await instructions). When you go to run a trial you can insist that it only be run locally, or you can ask that it be distributed to other systems. If you ask it to be distributed to other systems the master (which will have a list of all systems that are currently idle as systems will heartbeat every ~minute) can send that job and get you some additional processing power. If someone returns to their system mid-trial, it'll just kill off all active trials on their box so it doesn't interfere with their work and the master will handle re-allocating them.
3) Adding easy support for cloud platforms (all of my experience is with AWS) is very easy assuming we go with the simple design I've discussed above. The only part that requires some experience is creating an Amazon Machine Image which contains NTRT, which I can take care of in 30 minutes or so. Then if you decide that you have some trials that you really want to crunch through quickly, you can just log onto AWS and spin up a few instances and let them run and pay a couple of dollars. Here's Amazon EC2's pricing list: http://aws.amazon.com/ec2/pricing/ . Of course, this isn't something you have to use, but supporting it will be no different than supporting running on idle systems, so I'll send you the instructions on how to do it in case you ever want to.
If we go with the most basic approach:
This is something I could easily finish in 2 days (this includes the process which systems can run so that their power can be used towards trials when idle). If either of the requirements for the basic approach are a no-go, let me know and we can work to find a solution/I'll provide an updated timeline.
Perry
An aside regarding Amazon Web Services: AWS has educational grants for research/teaching. The next application deadline is March 31st. Details can be found at the link below.
Perry - thanks, that was a lot of helpful info.
To answer your questions:
So to summarize, your simple multi-process approach seems sufficient. The one catch is the learning library currently doesn't support JSON, so it might be worth the effort simplify the file structure - my more complex learning runs currently result in up to 110 files (see resources/src/learningSpines/TetrahedralComplex/logs), which is unnecessary. Its probably only a day or two to fix, but I'm not sure if I'll have time to do it until March (several grant and paper deadlines between now and then).
As far as your timeline, I expect my most computationally expensive machine learning will be in March and April, so that sounds fine to me.
When to apply for AWS funding should be a group decision, I expect it wouldn't arrive in time to help with my dissertation, and the grants are up to two years, so we should coordinate on what usage would be over a longer period. I did look at their pricing at one point and notice our work was too computationally expensive for their free trial offers.
All of this sounds great, thanks for your help!
"At least on my lab desktop, I'd be happy to just limit the number of processes if the computer is in use, if that isn't too much more complicated."
Definitely doable. We can make it use all cores by default, but you can override it in an ini file (similar to MAX_BUILD_CORES in build.conf).
"The one catch is the learning library currently doesn't support JSON, so it might be worth the effort simplify the file structure - my more complex learning runs currently result in up to 110 files (see resources/src/learningSpines/TetrahedralComplex/logs), which is unnecessary."
If we could handle the existing file structure, would that be beneficial? I mentioned JSON as I was under the impression that's what you are already using, but we can certainly support the existing structure without much trouble. Whatever you prefer here is fine -- it's no more difficult to transfer 100 files than it is to transfer 1.
I'm a bit fuzzy on the exact details of how the trial part of this works. Is it simple enough that you can write up a brief walkthrough so I can try the genetic algorithms myself? I only need a few cycles showing what the starting data is, what goes out to the sim from Python, what the sim does, what the sim outputs, how that's then read back into Python and how that affects the next trial.
Short answer: a 'trial' is a function evaluation. Here's a simple example I used to test my new genetic algorithms code: https://github.com/NASA-Tensegrity-Robotics-Toolkit/NTRTsim/blob/master/src/dev/btietz/GATests/AppGATests.cpp
Within the context of NTRT, this means that we run a simulation for n simulation seconds, and evaluate how well the controller did (typically how far the structure moved).
More detail: Everything associated with that occurs in C++ right now. I took the opportunity to write a quick tutorial for the learning library, hopefully it will clear things up, please ask questions: http://ntrtsim.readthedocs.org/en/latest/learning-library-walkthrough.html
There are two options on how an online Python adapter could work:
Holding on to our existing parameter file structure is beneficial for interfacing with the neural networks library (with either option above), but if we aren't doing that then anything that gives C++ a vector of a vector of doubles is just fine. If we feel like refactoring a little, we could update to Boost's multiArrays.
Excellent, thanks Brian.
I don't know enough about what you're doing to weigh in on whether we should go with approach one or two. That said, the Python version is easy to implement (I'll just need to provide you facilities for bringing data in from tests pushing data out to tests). I'll likely begin with that, then we can explore whether having it done using existing libraries/C++ is worth adding as well.
Hey all, I just want to throw out there one quick idea: -- this is where an existing machine leaning library may be of us. Given the potential computational load of evolutionary algorithms, I would imagine that there are tools in existence that handle all the back end infrastructure to work on a distributed computation model. While it might not make sense to use a library just for a single threaded stand-alone ML application because the algorithms are actually easy -- they the libraries handle all the details of farming out large jobs to multiple processors or AWS instances, that would be worth taking advantage of.
this thread has an interesting discussion on the topic.
which leads to this option, which looks promising and under active use: https://github.com/jbmouret/sferes2
vytas
On Feb 2, 2015, at 9:16 AM, Perry Bhandal notifications@github.com wrote:
"At least on my lab desktop, I'd be happy to just limit the number of processes if the computer is in use, if that isn't too much more complicated."
Definitely doable. We can make it use all cores by default, but you can override it in an ini file (similar to MAX_BUILD_CORES in build.conf).
"The one catch is the learning library currently doesn't support JSON, so it might be worth the effort simplify the file structure - my more complex learning runs currently result in up to 110 files (see resources/src/learningSpines/TetrahedralComplex/logs), which is unnecessary."
If we could handle the existing file structure, would that be beneficial? I mentioned JSON as I was under the impression that's what you are already using, but we can certainly support the existing structure without much trouble. Whatever you prefer here is fine -- it's no more difficult to transfer 100 files than it is to transfer 1.
I'm a bit fuzzy on the exact details of how the trial part of this works. Is it simple enough that you can write up a brief walkthrough so I can try the genetic algorithms myself? I only need a few cycles showing what the starting data is, what goes out to the sim from Python, what the sim does, what the sim outputs, how that's then read back into Python and how that affects the next trial.
— Reply to this email directly or view it on GitHub.
Vytas SunSpiral
Dynamic Tensegrity Robotics Lab
cell- 510-847-4600
Office: 650-604-4363
N269 Rm. 100
Stinger Ghaffarian Technologies Intelligent Robotics Group NASA Ames Research Center
I will not tiptoe cautiously through life only to arrive safely at death.
also, there should be existing tools for farming out and controlling many identical (but parameterized) processes on many machines.
In some ways, I suspect that Map-Reduce might do the trick, if you can frame the problem correctly as a data reduction problem (there is some search space we are reducing….).
Ideally we use existing tools rather than building our own for this. This is something that has been done many times.
vytas
On Feb 2, 2015, at 8:33 PM, Perry Bhandal notifications@github.com wrote:
Excellent, thanks Brian.
I don't know enough about what you're doing to weigh in on whether we should go with approach one or two. That said, the Python version is easy to implement (I'll just need to provide you facilities for bringing data in from tests pushing data out to tests). I'll likely begin with that, then we can explore whether having it done using existing libraries/C++ is worth adding as well.
— Reply to this email directly or view it on GitHub.
Vytas SunSpiral
Dynamic Tensegrity Robotics Lab
cell- 510-847-4600
Office: 650-604-4363
N269 Rm. 100
Stinger Ghaffarian Technologies Intelligent Robotics Group NASA Ames Research Center
I will not tiptoe cautiously through life only to arrive safely at death.
As far as using existing libraries, no argument there! Dispy (dispy.sourceforge.net) is one I've used before, but there certainly could be better choices out there
It's tough for me to comment on sferes2/MapReduce specifically related to machine learning, but my main concern with MapReduce is that learning libraries in NTRT will have to conform to it (while a solution like Dispy will be the other way around). If the simulation of structures in NTRT is the more expensive component of learning (as opposed to generating the structures to test), then MapReduce will be less useful. Additionally, there is a decent learning curve to getting started with, say, Hadoop. Testing it is tougher, and a user needs to know the difference between core and task nodes, how to configure HDFS, etc.
On the other hand, if Dipsy is used the transition from a single system single process approach to learning can easily be distributed to multiple cores, or even to multiple systems. It'll also be much easier to benefit from leftover computing power if we use Dipsy: we can include a script with NTRT that users can optionally run which will use their system for trials when idle.
In any event, regardless of which approach is used presumably there will need to be a master with reasonable up/down which is easily accessible (i.e. either it has a public IP, or has the authority to forward ports if it's NATted.) If some space is needed to host a master, let me know and I'd be happy to provide a Debian container on the box that hosts BuildBot.
Dispy looks great from a quick glance. Exactly what we need for easy farming out of an essentially stand-alone process. I like that it generalizes from multi-core, to multi-machines, and supports amazon cloud too.
Brian -- we could get you setup with Ames credentials and remote login, and then we could set up a cluster of machines here in the lab -- they are just sitting on the shelves right now. I think they are all quad core, and I bet we could get 6-8 or them dedicated do you during the spring -- though during the summer they might all be used by interns…… (or not).
If you get things working with Dispy on a single multi-core machine, that would be a good first step….
vytas
On Feb 4, 2015, at 1:55 AM, Perry Bhandal notifications@github.com wrote:
As far as using existing libraries, no argument there! Dispy (dispy.sourceforge.net) is one I've used before, but there certainly could be better choices out there
It's tough for me to comment on sferes2/MapReduce specifically related to machine learning, but my main concern with MapReduce is that learning libraries in NTRT will have to conform to it (while a solution like Dispy will be the other way around). If the simulation of structures in NTRT is the more expensive component of learning (as opposed to generating the structures to test), then MapReduce will be less useful. Additionally, there is a decent learning curve to getting started with, say, Hadoop. Testing it is tougher, and a user needs to know the difference between core and task nodes, how to configure HDFS, etc.
On the other hand, if Dipsy is used the transition from a single system single process approach to learning can easily be distributed to multiple cores, or even to multiple systems. It'll also be much easier to benefit from leftover computing power if we use Dipsy: we can include a script with NTRT that users can optionally run which will use their system for trials when idle.
In any event, regardless of which approach is used presumably there will need to be a master with reasonable up/down which is easily accessible (i.e. either it has a public IP, or has the authority to forward ports if it's NATted.) If some space is needed to host a master, let me know and I'd be happy to provide a Debian container on the box that hosts BuildBot.
— Reply to this email directly or view it on GitHub.
Vytas SunSpiral
Dynamic Tensegrity Robotics Lab
cell- 510-847-4600
Office: 650-604-4363
N269 Rm. 100
Stinger Ghaffarian Technologies Intelligent Robotics Group NASA Ames Research Center
I will not tiptoe cautiously through life only to arrive safely at death.
Another option is BOINC: http://boinc.berkeley.edu/
BOINC is more complex than Dispy/RPyC, but it's been designed for volunteer computing. That's useful as it already includes idle detection/enabling or disabling tasks on a system, and it provides a solid web interface while Dispy is entirely console based.
It'll be a week or two before I can look into it more thoroughly, but I'll post back once I'm done with a comparison of the two. From there we can determine which is a better fit for our needs.
Sorry for the delay in getting back to you about this, Brian. Been quite busy recently. Should ease up a fair bit once my thesis is done.
Sorry for being slow on this Brian -- end of the semester is keeping me super busy.
For clustering, would it pose much of a problem if we can't support any custom libraries for now? You would still be able to contain a job inside some git branch (and have custom code there that you build) but avoiding the headache of a separate setup will make this a fair bit easier.
I'm not sure I fully understand the question, but I think just running from a branch of NTRT (with our existing libraries) will be fine if that's easier for you. Integration of a package like Dispy can wait if its a hassle.
And multi-process on a single machine is an 8x speedup over single process, so once you have that feel free to hold off on clustering for as long as you need in order to do it right.
Thanks, and good luck with the end of the semester!
On Fri, Apr 3, 2015 at 10:20 AM, Perry Bhandal notifications@github.com wrote:
Sorry for being slow on this Brian -- end of the semester is keeping me super busy.
For clustering, would it pose much of a problem if we can't support any custom libraries for now? You would still be able to contain a job inside some git branch (and have custom code there that you build) but avoiding the headache of a separate setup will make this a fair bit easier.
— Reply to this email directly or view it on GitHub https://github.com/NASA-Tensegrity-Robotics-Toolkit/NTRTsim/issues/133#issuecomment-89366229 .
The multi-process single machine will almost certainly be done this weekend. I'll send a message to the list (or in this thread) as soon as it's good to go!
Brian, The concurrent_processor branch has a working multi-threaded version of your job code.
Few notes:
1) You'll need to install the python psutil package. I should be doing this using the proper virtualenv stuff Ryan set up, but given that this solution will be thrown out when we complete clustering, you'll need to just install it globally for now using "sudo pip install psutil"
2) There's a second parameter now to your NTRTJobMaster script which is the number of concurrent processes you wish to run. This shouldn't exceed the total number of cores in your system (might want to do cores - 1 if you plan to work on the system as it's running jobs).
3) I'm not properly cleaning up the zombie processes that are forked. As a result, you're limited to about 30,000 trials in a single call to NTRTJobMaster.py. I'm assuming that's not a problem, but let me know if it is. The reason for the ~30k limit is that by not handling the forks cleanly, we're not cleaning up the PIDs until the entire Python script exits. Fixing this is possible if that limit poses a problem. I didn't bother for now as it would require additional time and this is just intended as a temporary stopgap until I can finish full-fledged clustering.
4) You'll want to look at the beginTrial method in BrianMaster. Specifically at line ~220 you'll need to modify the code there to process the results as you would like.
5) The spawned jobs now have their output written to /dev/null. I did this so you could have some helpful logging to see how the overall jobs are progressing. If you want to modify this, take a look at line 267 in NTRTJobMaster.py. You'll want to remove the "stdout=devNull" parameter to subprocess.call.
Hopefully this all works without a hitch for you, but if it doesn't please let me know!
Perry
If anyone else is going to attempt this in the near future, note that a prerequisite is installing the Python package installer. This can be done with sudo apt-get install python-pip
Also, if your Python install is really bare bones, you may need python-dev as well. If you get this error while trying to install psutil using pip:
psutil/_psutil_linux.c:12:20: fatal error: Python.h: No such file or directory compilation terminated.
installing python-dev is the solution.
Any reason we can't close this issue? Might be worth moving some of the conversation here into docs first, but otherwise I think this is done.
One thing that we discussed in the genetic algorithms conference call a few weeks ago but never got posted here was that it would be nice to be able to multi-thread learning runs. For monte-carlo this is trivial, the user can manually start multiple processes (using screen or similar), and just dump everything to a file and sort it out later. For a genetic algorithm this is more difficult, since the scores need to synchronize more often a program needs to start each process and collect information.
Ryan thought this might be easiest using Python and sockets to spin up each thread. I'd love to hear more about that approach and perhaps start implementing it. One advantage of that approach is it might generalize to cloud computing, if we ever get funding for that.
If anyone else has other ideas I'd love to hear them, I have very little experience with multi-threaded programming.