Distributed Computing Frameworks

pgrinaway commented 7 years ago

We've been talking about using modern distributed computing frameworks to simplify taking advantage of clusters, clouds, etc.

Some thoughts:

Celery - This allows you to call functions remotely by simply applying a decorator and starting some workers (as well as a broker, of which there are several choices, including redis, RabbitMQ, and Amazon SQS (among others)).
Redis - A high-speed caching system that can be used as a distributed cache or broker for something like Celery.
RabbitMQ - This can be used as a broker for Celery, but it can also be used to route tasks to workers that subscribe to certain topics. Take a look at this for more details.
Amazon SQS - A managed queuing service that we can use to avoid having to run a queuing service. Also delegates security concerns to the AWS security system (IAM, etc.), which is nice.
Amazon Elasticash - A managed redis service, which could avoid any trouble with networking, authentication, authorization, etc.

There's one main issue that I see with directly using Celery: the framework's ease of use rests on an assumption that the workers are basically identical, which is not the case if some have hardware-dependent (Context) caches. There may be a way around this, but it's not clear how complicated that is.

There's a lot to say, so I've kept this brief. Let me know what is confusing or missing.

jhprinz commented 7 years ago

Cool. I am very interested in the development of this. I was also trying to adapt openpathsampling to use celery and have a little example working. Currently very simple and I use a Redis server as broker and backend.

It also utilized the JSON pickling capabilities of OPS objects which made it very simple to dispatch trajectory jobs and get the full trajectory back.

We might also want to use this for the adaptive sampling, but current we are pursueing another approach using radical.pilot.

Last thing that @pgrinaway might comment on is the question, if the backend could be persistent and not only keep results until they are returned to the recipient. I had something like in siegetank in mind, where the returned trajectory is just kept in the DB also for later usage and other projects.

Turns out that the OPS pickling and use of a UUID is fully compatible with NoSQL DBs like mongoDB. I have not tried this, but if we could change the backend behaviour we would get the siegetank concept for free...

BTW... Happy new year

jchodera commented 7 years ago

Dask is another framework we may want to take a look at. It supports a number of different idioms for parallelization as well as distributed data structures.

There's one main issue that I see with directly using Celery: the framework's ease of use rests on an assumption that the workers are basically identical, which is not the case if some have hardware-dependent (Context) caches. There may be a way around this, but it's not clear how complicated that is.

For initial use, this may not be a limitation: An initial implementation doesn't even need to use Context caching; the benefit of a scalable distributed system could still greatly outweigh the overhead of a few seconds per task, especially if the tasks take O(10) seconds or more. Later implementations could add a Context cache for the GPU that the worker thread is bound to.

Note that OpenMM is actually pretty memory-efficient. A GTX-TITAN-X (Maxwell) with 12GB memory can cache 67 unmodified SrcExplicit testsystems.

pgrinaway commented 7 years ago

Later implementations could add a Context cache for the GPU that the worker thread is bound to.

We should determine how likely we are to want this. If we are likely to want it, we want to take into account how difficult this might be in various frameworks.

A GTX-TITAN-X (Maxwell) with 12GB memory can cache 67 unmodified SrcExplicit testsystems.

Remarkable! How does this scale with multiple forces? I think @Lnaden pointed out to me that things can get hairy not only with the number of particles, but also the number of forces and perhaps something else that I now forget.

Lnaden commented 7 years ago

Context creation is a function of the number of particles, number of forces, and complexity of the Custom*Force energy expressions. That being said, the CustomNonbondedForce is a much more expensive force than CustomBondForce due to the how many particles are in the force itself. I don't have any hard numbers, but instead anecdotal experience.

jchodera commented 7 years ago

We should determine how likely we are to want this. If we are likely to want it, we want to take into account how difficult this might be in various frameworks.

Definitely. There are a few levels of sophistication:

No caching
Each process manages its own caching
Some sort of smart pairing of tasks with workers who already have things cached

We may not need all three levels of capabilities, but it would be good to figure out what each of the distributed computing frameworks support.

jchodera commented 7 years ago

Remarkable! How does this scale with multiple forces? I think @Lnaden pointed out to me that things can get hairy not only with the number of particles, but also the number of forces and perhaps something else that I now forget.

I can run a quick benchmark. My original test just fed a bunch of openmmtools.testsystems through, but should I try creating alchemically-modified versions? Are there other kinds of testsystems that I should evaluate to see how many fit in different GPUs?

jhprinz commented 7 years ago

Definitely. There are a few levels of sophistication:

No caching

Each process manages its own caching

Some sort of smart pairing of tasks with workers who already have things cached

If you want to use Celery then

is very simple. You need to add some UUID to the object the contains information on how to generate the Context on the worker and just check, if the UUID is already existing. I am using something like this for the OPS example. Here the caching is done already inside of the JSON unpickling. It reuses objects still in the cache. This does not prevent objects to be transferred over the network to the worker, but existing ones will not be recreated. This is especially good for engines = contexts.
This one is tough and I only have a partial solution. The problem is this: You submit a message with your simulation details to the broker and it gets stored in a queue on this broker. A worker can subscribe to certain messages which it will receive and consume. That means the decision logic of "which worker receives which task" is done in the message broker and these have predefined ways to do that. From what I know you can have multiple queues and you can filter messages received from a specific queue on kind of keywords called topics. (See a RabbitMQ tutorial)

The bindings of a worker to some queue with specific filtering (topics) can be changed during runtime and so this bindings need to match the status of the cache. Celery allows you to change this during runtime. This would allow you to make worker accept only tasks for specific contexts -- the ones currently in the cache.

What I could not figure out is how to make the system prefer specific queues over others. Given the choice you want to assign a task using context xyz a worker that has this context xyz although others might be free and there is no mechanism for this. Priorities could work for this.

Some way to achieve this is to assign work to workers directly. Each worker can have a name and if you create a bunch of these you can setup routes so them individually. Then you need to distribute all work manually and keep track of which worker is finished. Makes sense for schemes like RepEx where you have lot of similar length tasks that need to sync.

jhprinz commented 7 years ago

Quick comment on Dask vs Celery. My quick impresion is that Celery is better for running trajectories where node run completely independent of each other and Dask better for analysis tasks where nodes will communicate and exchange data.

pgrinaway commented 7 years ago

Thanks @jhprinz ! I was thinking the same thing about 3. My initial impression is that we could use the features of RabbitMQ to do what you said, but I am also not quite sure what the best way to deal with this is:

Given the choice you want to assign a task using context xyz a worker that has this context xyz although others might be free and there is no mechanism for this.

Certainly, a few sec of overhead to create a new Context, but allow fuller use of resources is warranted. You're right that assigning to workers directly would work, though I wonder if there's a straightforward way to automate this (perhaps, as you say, using priorities). I'll think about this a bit more too.

choderalab / openmmmcmc

Distributed Computing Frameworks #13