brtietz commented 9 years ago

One thing that we discussed in the genetic algorithms conference call a few weeks ago but never got posted here was that it would be nice to be able to multi-thread learning runs. For monte-carlo this is trivial, the user can manually start multiple processes (using screen or similar), and just dump everything to a file and sort it out later. For a genetic algorithm this is more difficult, since the scores need to synchronize more often a program needs to start each process and collect information.

Ryan thought this might be easiest using Python and sockets to spin up each thread. I'd love to hear more about that approach and perhaps start implementing it. One advantage of that approach is it might generalize to cloud computing, if we ever get funding for that.

If anyone else has other ideas I'd love to hear them, I have very little experience with multi-threaded programming.

PerryBhandal commented 9 years ago

Multi-threaded programming is an area I would be happy to (try and) help with if you need any assistance.

It's tough for me to say whether I could be of any help immediately, but I have a decent understanding of concurrent programming and would love to learn more. One potential idea (depending on how much data needs to be transferred for the runs to synchronize with each other) is to make it distributable. I'm not sure how helpful additional computing time would be, but collectively I could probably contribute 4-6 systems that are currently idle for most of the day.

brtietz commented 9 years ago

Currently it takes ~8kB to specify one of my controllers (in text files), its a couple hundred doubles. Specifying scores (the results) is only another double or two. This could grow or shrink, depending on whether I find ways to reduce the number of parameters, how soon we can incorperate zeroMQ, and how much data it takes to specify a structure once I start learning on those. I don't expect it would ever be more than 1 mB per trial. Each trial takes between 10 and 100 seconds and sends a single core to 100%, so sounds quite distributable.

If we can get it working multi-core first, it will take what I'm doing now from 'over a weekend' down to 'overnight' which is a huge difference. If going from that to fully distributed isn't too much extra work, it would be great to have a few extra machines and get the runtime down to a long lunch. ;)

PerryBhandal commented 9 years ago

I should mention upfront that I wouldn't be able to do any implementation until the 14th of this month, but I'll have a lot of time during that week between midterms being over and reading week. Assuming that's okay, here are some of my ideas.

Making it distributable locally (single system)

If structured properly, allowing this to be distributed across many systems shouldn't be difficult. There are some design decisions which will make it more/less difficult, though.

Design Decision One: Separate threads, or separate processes?

The approach that's easiest to implement is to use one NTRT process per thread. So the existing code remains single-threaded and simply emits the data it creates now. That data would go into a Python script (presumably) and the Python script would be responsible for synchronizing data. We could make it even easier if to implement if we specify that an NTRT process is spawned to run a trial, and once that trial is complete that NTRT process ends (i.e. the Python script spawns an NTRT process, waits for it to finish, the NTRT process finishes and closes, then the Python script processes the data the trial produced and spawns another NTRT instance for the subsequent trial).

Benefits of this approach:

No multi-threading is done in C++. In fact, there won't be any multi-threading done anywhere -- we'll get the benefits of concurrent trials without the pain of writing multi-threaded code.
If a single trial has a failure (is that common? how often do you get exceptions that are unhandled or trials that otherwise fail?) we're not risking the loss of all other work that was completed. There are, of course, ways to handle failure even if we have multiple threads running in a single process, but if we can avoid it we may as well. With a single process approach if a job dispatched by the master fails, it can just let that process die off and decide what to do from there (re-allocate the job? is there some problem inherent to the job, in which case should we be dispatching it again?)
Taking this from being distributable on one system to distributable across many systems would be easy. The only thing we would need to do is add some method for transferring the communication between each learning 'instance' and the master server. IMO HTTP would be a good choice here, but I'll defer to Ryan (if we use HTTP a good choice might be Python's Flask framework).

Downside:

Communication between different trials mid-trial will be more difficult (in fact, if we go with a separate process approach I would suggest we not do any inter-trial communication unless absolutely necessary). So that raises a key question: if each trial takes about 10-100 seconds, is there any communication that occurs during that 10 to 100 second period? Or is it fine if it's given parameters for a trial, it runs the trial, produces data, submits that data to the synchronization master and then waits for the next trial? Additionally, they're not going to have shared memory. If you want separate trials to have access to the same data structures, we're going to have to do this in a single process.
We can make this even easier to code by not only using separate processes, but also spawning one process per trial. So you have a trial that needs to be run, the master spawns that process, provides it the data it needs, the trial runs, finishes, writes data out, the process exits, the master then processes the data that was produced. Again, the big benefit here is the ease of coding it, the disadvantage is that if startup/shutdown takes a long time you're going to have that time added on to each trial. The alternative to this is, of course, to have a process that runs a trial, finishes the trial, the process remains open and grabs the instructions for another trial and continually repeats this process.

A locally distributable solution should look similar to a globally distributable solution

As mentioned, if designed properly allowing this to run across multiple systems is easy. Here are a few key points/questions:

1) The more similar the local and global solutions look, the easier this is to distribute. So, I ask: with your current approach, how do you start trials and how do you collect data? Does your Python code simply generate a controller with the new parameters, then launch an NTRT process, then once the NTRT trial is finished it processes the data that came out and starts another trial? If that's the case, I would suggest we not use ZeroMQ's messaging immediately (I should mention that ZeroMQ does offer the ability to send messages via TCP, so using ZeroMQ even with a globally distributed solution is possible). ZeroMQ would obviously work, but if there is such a small amount of communication between the Python code and your trial, and you already have it written to just write data out to disk we should just leave it that way. I don't see ZeroMQ offering us any significant advantages here (hopefully Ryan can correct me if I'm overlooking something).

2) As far as sharing computing power, the easiest way to do that is to have a script which can build a separate NTRT install and we have a little Python process that waits in the background. If the system is idle for > x minutes, we'll make that system available to help run trials (it'll register with the master and await instructions). When you go to run a trial you can insist that it only be run locally, or you can ask that it be distributed to other systems. If you ask it to be distributed to other systems the master (which will have a list of all systems that are currently idle as systems will heartbeat every ~minute) can send that job and get you some additional processing power. If someone returns to their system mid-trial, it'll just kill off all active trials on their box so it doesn't interfere with their work and the master will handle re-allocating them.

3) Adding easy support for cloud platforms (all of my experience is with AWS) is very easy assuming we go with the simple design I've discussed above. The only part that requires some experience is creating an Amazon Machine Image which contains NTRT, which I can take care of in 30 minutes or so. Then if you decide that you have some trials that you really want to crunch through quickly, you can just log onto AWS and spin up a few instances and let them run and pay a couple of dollars. Here's Amazon EC2's pricing list: http://aws.amazon.com/ec2/pricing/ . Of course, this isn't something you have to use, but supporting it will be no different than supporting running on idle systems, so I'll send you the instructions on how to do it in case you ever want to.

Timeline

If we go with the most basic approach:

No ZeroMQ messaging, we simply read from JSON files to start a trial, and write out the results to another file that Python reads.
Each trial results in a new process. Process starts up, reads structure from JSON, does trial, writes data out, process dies.

This is something I could easily finish in 2 days (this includes the process which systems can run so that their power can be used towards trials when idle). If either of the requirements for the basic approach are a no-go, let me know and we can work to find a solution/I'll provide an updated timeline.

Perry

PerryBhandal commented 9 years ago

An aside regarding Amazon Web Services: AWS has educational grants for research/teaching. The next application deadline is March 31st. Details can be found at the link below.

http://aws.amazon.com/grants/

brtietz commented 9 years ago

Perry - thanks, that was a lot of helpful info.

To answer your questions:

Process failure is uncommon. I do use try-catch to stop trials that I know have issues via thrown exceptions, but once it works (all input errors are handled) error handling is robust enough to run for tens of thousands of trials (at least). Typically if a job has an issue, resetting the simulation and providing it with a different set of parameters is sufficient to solve it.
No communication needs to occur during that 10 - 100 second period. All data is sent at the beginning of the trial and collected at the end of the trial.
My current approach for genetic algorithms is exclusively C++, and is therefore single process. When I do multi-process Monte Carlo, I start each process from the command line in screen, and they all dump to a common .csv file. After 10 - 20 thousand trials, I stop the processes, use a Python script to generate the .nnw files (well, Matlab until last week), and then see what happened, the learning library can automatically parametrize a controller from these .nnw files (they're really just another set of .csv files). So there is no inter-process communication, that's the large remaining problem. Steve and Alex may use more complex Python scripts, I'll let them post here if theirs is more sophisticated.
Startup and shutdown are no different than standard NTRT startup/shutdown, as in less than a second.
At least on my lab desktop, I'd be happy to just limit the number of processes if the computer is in use, if that isn't too much more complicated.

So to summarize, your simple multi-process approach seems sufficient. The one catch is the learning library currently doesn't support JSON, so it might be worth the effort simplify the file structure - my more complex learning runs currently result in up to 110 files (see resources/src/learningSpines/TetrahedralComplex/logs), which is unnecessary. Its probably only a day or two to fix, but I'm not sure if I'll have time to do it until March (several grant and paper deadlines between now and then).

As far as your timeline, I expect my most computationally expensive machine learning will be in March and April, so that sounds fine to me.

When to apply for AWS funding should be a group decision, I expect it wouldn't arrive in time to help with my dissertation, and the grants are up to two years, so we should coordinate on what usage would be over a longer period. I did look at their pricing at one point and notice our work was too computationally expensive for their free trial offers.

All of this sounds great, thanks for your help!

PerryBhandal commented 9 years ago

"At least on my lab desktop, I'd be happy to just limit the number of processes if the computer is in use, if that isn't too much more complicated."

Definitely doable. We can make it use all cores by default, but you can override it in an ini file (similar to MAX_BUILD_CORES in build.conf).

"The one catch is the learning library currently doesn't support JSON, so it might be worth the effort simplify the file structure - my more complex learning runs currently result in up to 110 files (see resources/src/learningSpines/TetrahedralComplex/logs), which is unnecessary."

If we could handle the existing file structure, would that be beneficial? I mentioned JSON as I was under the impression that's what you are already using, but we can certainly support the existing structure without much trouble. Whatever you prefer here is fine -- it's no more difficult to transfer 100 files than it is to transfer 1.

I'm a bit fuzzy on the exact details of how the trial part of this works. Is it simple enough that you can write up a brief walkthrough so I can try the genetic algorithms myself? I only need a few cycles showing what the starting data is, what goes out to the sim from Python, what the sim does, what the sim outputs, how that's then read back into Python and how that affects the next trial.

brtietz commented 9 years ago

Short answer: a 'trial' is a function evaluation. Here's a simple example I used to test my new genetic algorithms code: https://github.com/NASA-Tensegrity-Robotics-Toolkit/NTRTsim/blob/master/src/dev/btietz/GATests/AppGATests.cpp

Within the context of NTRT, this means that we run a simulation for n simulation seconds, and evaluate how well the controller did (typically how far the structure moved).

More detail: Everything associated with that occurs in C++ right now. I took the opportunity to write a quick tutorial for the learning library, hopefully it will clear things up, please ask questions: http://ntrtsim.readthedocs.org/en/latest/learning-library-walkthrough.html

There are two options on how an online Python adapter could work:

The learning algorithm is completely implemented in Python. This would simplify things for those who just want to run Monte Carlo or a simple hill climbing algorithm.
The 'master' computer uses our existing library to generate parameters, sending parameters to the 'slaves' and receiving scores back. In this case we would need to write two adapters, but it may save time since we don't have to re-implement the algorithms.

Holding on to our existing parameter file structure is beneficial for interfacing with the neural networks library (with either option above), but if we aren't doing that then anything that gives C++ a vector of a vector of doubles is just fine. If we feel like refactoring a little, we could update to Boost's multiArrays.

PerryBhandal commented 9 years ago

Excellent, thanks Brian.

I don't know enough about what you're doing to weigh in on whether we should go with approach one or two. That said, the Python version is easy to implement (I'll just need to provide you facilities for bringing data in from tests pushing data out to tests). I'll likely begin with that, then we can explore whether having it done using existing libraries/C++ is worth adding as well.

vsunspiral commented 9 years ago

Hey all, I just want to throw out there one quick idea: -- this is where an existing machine leaning library may be of us. Given the potential computational load of evolutionary algorithms, I would imagine that there are tools in existence that handle all the back end infrastructure to work on a distributed computation model. While it might not make sense to use a library just for a single threaded stand-alone ML application because the algorithms are actually easy -- they the libraries handle all the details of farming out large jobs to multiple processors or AWS instances, that would be worth taking advantage of.

this thread has an interesting discussion on the topic.

http://softwarerecs.stackexchange.com/questions/2415/is-there-a-good-parallel-genetic-algorithm-library-for-c-c

which leads to this option, which looks promising and under active use: https://github.com/jbmouret/sferes2

vytas

On Feb 2, 2015, at 9:16 AM, Perry Bhandal notifications@github.com wrote:

"At least on my lab desktop, I'd be happy to just limit the number of processes if the computer is in use, if that isn't too much more complicated."

Definitely doable. We can make it use all cores by default, but you can override it in an ini file (similar to MAX_BUILD_CORES in build.conf).

"The one catch is the learning library currently doesn't support JSON, so it might be worth the effort simplify the file structure - my more complex learning runs currently result in up to 110 files (see resources/src/learningSpines/TetrahedralComplex/logs), which is unnecessary."

If we could handle the existing file structure, would that be beneficial? I mentioned JSON as I was under the impression that's what you are already using, but we can certainly support the existing structure without much trouble. Whatever you prefer here is fine -- it's no more difficult to transfer 100 files than it is to transfer 1.

I'm a bit fuzzy on the exact details of how the trial part of this works. Is it simple enough that you can write up a brief walkthrough so I can try the genetic algorithms myself? I only need a few cycles showing what the starting data is, what goes out to the sim from Python, what the sim does, what the sim outputs, how that's then read back into Python and how that affects the next trial.

— Reply to this email directly or view it on GitHub.

Vytas SunSpiral
Dynamic Tensegrity Robotics Lab cell- 510-847-4600 Office: 650-604-4363 N269 Rm. 100

Stinger Ghaffarian Technologies Intelligent Robotics Group NASA Ames Research Center

I will not tiptoe cautiously through life only to arrive safely at death.

vsunspiral commented 9 years ago

also, there should be existing tools for farming out and controlling many identical (but parameterized) processes on many machines.

In some ways, I suspect that Map-Reduce might do the trick, if you can frame the problem correctly as a data reduction problem (there is some search space we are reducing….).

Ideally we use existing tools rather than building our own for this. This is something that has been done many times.

vytas

On Feb 2, 2015, at 8:33 PM, Perry Bhandal notifications@github.com wrote:

Excellent, thanks Brian.

I don't know enough about what you're doing to weigh in on whether we should go with approach one or two. That said, the Python version is easy to implement (I'll just need to provide you facilities for bringing data in from tests pushing data out to tests). I'll likely begin with that, then we can explore whether having it done using existing libraries/C++ is worth adding as well.

— Reply to this email directly or view it on GitHub.

Vytas SunSpiral
Dynamic Tensegrity Robotics Lab cell- 510-847-4600 Office: 650-604-4363 N269 Rm. 100

Stinger Ghaffarian Technologies Intelligent Robotics Group NASA Ames Research Center

I will not tiptoe cautiously through life only to arrive safely at death.

PerryBhandal commented 9 years ago

As far as using existing libraries, no argument there! Dispy (dispy.sourceforge.net) is one I've used before, but there certainly could be better choices out there

It's tough for me to comment on sferes2/MapReduce specifically related to machine learning, but my main concern with MapReduce is that learning libraries in NTRT will have to conform to it (while a solution like Dispy will be the other way around). If the simulation of structures in NTRT is the more expensive component of learning (as opposed to generating the structures to test), then MapReduce will be less useful. Additionally, there is a decent learning curve to getting started with, say, Hadoop. Testing it is tougher, and a user needs to know the difference between core and task nodes, how to configure HDFS, etc.

On the other hand, if Dipsy is used the transition from a single system single process approach to learning can easily be distributed to multiple cores, or even to multiple systems. It'll also be much easier to benefit from leftover computing power if we use Dipsy: we can include a script with NTRT that users can optionally run which will use their system for trials when idle.

In any event, regardless of which approach is used presumably there will need to be a master with reasonable up/down which is easily accessible (i.e. either it has a public IP, or has the authority to forward ports if it's NATted.) If some space is needed to host a master, let me know and I'd be happy to provide a Debian container on the box that hosts BuildBot.

vsunspiral commented 9 years ago

Dispy looks great from a quick glance. Exactly what we need for easy farming out of an essentially stand-alone process. I like that it generalizes from multi-core, to multi-machines, and supports amazon cloud too.

Brian -- we could get you setup with Ames credentials and remote login, and then we could set up a cluster of machines here in the lab -- they are just sitting on the shelves right now. I think they are all quad core, and I bet we could get 6-8 or them dedicated do you during the spring -- though during the summer they might all be used by interns…… (or not).

If you get things working with Dispy on a single multi-core machine, that would be a good first step….

vytas

On Feb 4, 2015, at 1:55 AM, Perry Bhandal notifications@github.com wrote:

As far as using existing libraries, no argument there! Dispy (dispy.sourceforge.net) is one I've used before, but there certainly could be better choices out there

It's tough for me to comment on sferes2/MapReduce specifically related to machine learning, but my main concern with MapReduce is that learning libraries in NTRT will have to conform to it (while a solution like Dispy will be the other way around). If the simulation of structures in NTRT is the more expensive component of learning (as opposed to generating the structures to test), then MapReduce will be less useful. Additionally, there is a decent learning curve to getting started with, say, Hadoop. Testing it is tougher, and a user needs to know the difference between core and task nodes, how to configure HDFS, etc.

On the other hand, if Dipsy is used the transition from a single system single process approach to learning can easily be distributed to multiple cores, or even to multiple systems. It'll also be much easier to benefit from leftover computing power if we use Dipsy: we can include a script with NTRT that users can optionally run which will use their system for trials when idle.

In any event, regardless of which approach is used presumably there will need to be a master with reasonable up/down which is easily accessible (i.e. either it has a public IP, or has the authority to forward ports if it's NATted.) If some space is needed to host a master, let me know and I'd be happy to provide a Debian container on the box that hosts BuildBot.

— Reply to this email directly or view it on GitHub.

Vytas SunSpiral
Dynamic Tensegrity Robotics Lab cell- 510-847-4600 Office: 650-604-4363 N269 Rm. 100

Stinger Ghaffarian Technologies Intelligent Robotics Group NASA Ames Research Center

I will not tiptoe cautiously through life only to arrive safely at death.

PerryBhandal commented 9 years ago

Another option is BOINC: http://boinc.berkeley.edu/

BOINC is more complex than Dispy/RPyC, but it's been designed for volunteer computing. That's useful as it already includes idle detection/enabling or disabling tasks on a system, and it provides a solid web interface while Dispy is entirely console based.

It'll be a week or two before I can look into it more thoroughly, but I'll post back once I'm done with a comparison of the two. From there we can determine which is a better fit for our needs.

PerryBhandal commented 9 years ago

Sorry for the delay in getting back to you about this, Brian. Been quite busy recently. Should ease up a fair bit once my thesis is done.

PerryBhandal commented 9 years ago

Sorry for being slow on this Brian -- end of the semester is keeping me super busy.

For clustering, would it pose much of a problem if we can't support any custom libraries for now? You would still be able to contain a job inside some git branch (and have custom code there that you build) but avoiding the headache of a separate setup will make this a fair bit easier.

brtietz commented 9 years ago

I'm not sure I fully understand the question, but I think just running from a branch of NTRT (with our existing libraries) will be fine if that's easier for you. Integration of a package like Dispy can wait if its a hassle.

And multi-process on a single machine is an 8x speedup over single process, so once you have that feel free to hold off on clustering for as long as you need in order to do it right.

Thanks, and good luck with the end of the semester!

On Fri, Apr 3, 2015 at 10:20 AM, Perry Bhandal notifications@github.com wrote:

Sorry for being slow on this Brian -- end of the semester is keeping me super busy.

For clustering, would it pose much of a problem if we can't support any custom libraries for now? You would still be able to contain a job inside some git branch (and have custom code there that you build) but avoiding the headache of a separate setup will make this a fair bit easier.

— Reply to this email directly or view it on GitHub https://github.com/NASA-Tensegrity-Robotics-Toolkit/NTRTsim/issues/133#issuecomment-89366229 .

PerryBhandal commented 9 years ago

The multi-process single machine will almost certainly be done this weekend. I'll send a message to the list (or in this thread) as soon as it's good to go!

PerryBhandal commented 9 years ago

Brian, The concurrent_processor branch has a working multi-threaded version of your job code.

Few notes:

1) You'll need to install the python psutil package. I should be doing this using the proper virtualenv stuff Ryan set up, but given that this solution will be thrown out when we complete clustering, you'll need to just install it globally for now using "sudo pip install psutil"

2) There's a second parameter now to your NTRTJobMaster script which is the number of concurrent processes you wish to run. This shouldn't exceed the total number of cores in your system (might want to do cores - 1 if you plan to work on the system as it's running jobs).

3) I'm not properly cleaning up the zombie processes that are forked. As a result, you're limited to about 30,000 trials in a single call to NTRTJobMaster.py. I'm assuming that's not a problem, but let me know if it is. The reason for the ~30k limit is that by not handling the forks cleanly, we're not cleaning up the PIDs until the entire Python script exits. Fixing this is possible if that limit poses a problem. I didn't bother for now as it would require additional time and this is just intended as a temporary stopgap until I can finish full-fledged clustering.

4) You'll want to look at the beginTrial method in BrianMaster. Specifically at line ~220 you'll need to modify the code there to process the results as you would like.

5) The spawned jobs now have their output written to /dev/null. I did this so you could have some helpful logging to see how the overall jobs are progressing. If you want to modify this, take a look at line 267 in NTRTJobMaster.py. You'll want to remove the "stdout=devNull" parameter to subprocess.call.

Hopefully this all works without a hitch for you, but if it doesn't please let me know!

Perry

brtietz commented 9 years ago

If anyone else is going to attempt this in the near future, note that a prerequisite is installing the Python package installer. This can be done with sudo apt-get install python-pip

brtietz commented 9 years ago

Also, if your Python install is really bare bones, you may need python-dev as well. If you get this error while trying to install psutil using pip:

psutil/_psutil_linux.c:12:20: fatal error: Python.h: No such file or directory compilation terminated.

installing python-dev is the solution.

brtietz commented 8 years ago

Any reason we can't close this issue? Might be worth moving some of the conversation here into docs first, but otherwise I think this is done.

NASA-Tensegrity-Robotics-Toolkit / NTRTsim

Multithreaded Learning #133

Making it distributable locally (single system)

Design Decision One: Separate threads, or separate processes?

A locally distributable solution should look similar to a globally distributable solution

Timeline