Closed breezykermo closed 4 years ago
@RaphiePS here's an issue for parallelisation as discussed in #10 . Would love your help on this one!
Here are your suggestions reprinted, I'll address in subsequent comments!
Create a superclass for both Selector
and Analyser
that encapsulates all this indexing/parallel-processing/logging/resuming functionality. Both Selector
and Analyser
need to follow roughly the same steps (figure out what needs to be done, do that in parallel, and record it being done), and that code could be shared in a Processor
superclass.
Python's GIL means multithreaded code isn't actually running in parallel 😠We'd probably want to use multiprocessing
. It's much harder to communicate between these workers (no shared memory), but as you said, these tasks are embarrassingly parallel; they shouldn't really need to communicate anyway.
I'd probably move from CSV map files to SQLite. With multiple processes reading and writing to the map file, things could get pretty dicey without synchronization. SQLite handles all that for you, letting you read and write without locking the file. I've quite enjoyed using dataset for this sort of thing. It's at a happy medium between "write all your SQL queries by hand" and "deal with a massively bloated ORM."
Resumable + parallel job scheduling is a pretty common problem, and there might be an existing library that solves it well and adds other goodies. I've seen a lot of Redis-based frameworks, but that would be overkill. I'll see if I can find a simpler solution that would be easy to integrate.
I agree that multiprocessing
is the most appropriate here. The notion of an 'element' in mtriage is constructed specifically as an independent media or set of media-- so if the analysis or retrieval of an element starts begging for shared memory, it a sign that the media probably isn't appropriate to be treated as an element in the first place. :bowtie:
I like the idea of finding a simpler solution than redis. When I first built mtriage it actually used a Redis store to keep track of elements, but at some point I removed it and just used the filesystem to avoid technical debt down the line. Agree that we might run into trouble with CSV files with parallelisation, though.
I haven't used dataset before personally, but I know @pudo and would love to dig into it. Hi Fred!
The parallelising superclass is an interesting suggestion. I'll admit that I'm not 100% on how it could work. I would have thought that trying to comport both selector and analyser to a superclass might be doing too much all at once. Personally I would start out treating the parallisation separately, and then see what can be shared after the implementations are working.
But that's just me! If you're confident about it, go for it and then maybe I'll be able to see its light.
Ray looks really compelling! Performance-wise, it seems to help out most when there's shared data or communication, both of which we don't, at least for the time being, have to deal with. What I'm really interested in is their "developer ergonomics."
One of my big design goals for the parallelism feature is that the user shouldn't know or care about parallelism. When I say "user" I mean a person who wants to write a custom Analyser or Selector without digging into the MT framework itself. They should write their code as if it's running only one copy, and MT should orchestrate everything behind the scenes. But things get kind of weird with multiprocessing
. For example, say we run an Analyser like this in parallel:
class SomeAnalyser(Analyser):
def setup_run(self):
self.instance_variable = SomeValue
# this will be run in parallel in a multiprocessing.Pool
def run_element(self, element):
return self.do_something_with(self.instance_variable)
# this method would be inside the Analyser superclass,
# not re-implemented by each individual Analyser
def run(self):
self.setup_run()
pool = multiprocessing.Pool(4)
elements = [1, 2, 3]
for result in pool.imap_unordered(self.run_element, elements):
print(result)
When you run an instance method (in this case run_element
) with multiprocessing
, it will serialize the entire instance using pickle
and ship it off to a worker process. The worker will deserialize the instance, run the method, and send back a pickled result. This is all fine and dandy until you have something like:
def setup_run(self):
self.instance_variable = SomeValueThatCantBePickled
There are lots of objects that can't be pickled - files, locks, database handles - and it'll blow up if you try to run it through multiprocessing. Yuck.
We could tell the user "just make sure you can pickle everything!" but that's super lame - it means they have to partially take on the burden of parallelization, and there might be necessary libraries that just can't be pickled.
Another solution might be: instead of executing setup_run
once in the main thread, execute it inside the worker process, once for each element. The code would look like this:
# this function will be executed inside the worker process
def run_element_wrapper(element):
brand_new_instance = SomeAnalyser()
brand_new_instance.setup_run()
brand_new_instance.run_element(element)
def run(self):
self.setup_run()
pool = multiprocessing.Pool(4)
elements = [1, 2, 3]
for result in pool.imap_unordered(run_element_wrapper, elements):
print(result)
This works on a technical level and solves the "pickle problem," because unpickable values no longer need to be sent to the worker thread. Instead of serializing the existing instance and sending it to the worker, we just create a new instance for each worker, eliminating the pickling.
But I think it's pretty confusing for users. This would mean there are some methods in an Analyser
/ Selector
that should are called on the main thread (__init___
and index
), and some that should be called inside a worker process (setup_run
and run_element
). It'd be impossible to know which is which without referring to the docs. Mistakes would be subtle and hard to catch. It's too implicit.
A much more explicit version might have the user put all the worker-run code in a separate class:
class SomeAnalyser(Analyser):
# this, and everything in the top-level SomeAnalyser class,
# will be run on the main thread
def __init__(self):
self.other_main_thread_method()
self.set_runner_class(SomeRunner)
# also run on main thread
def other_main_thread_method(self):
pass
# this class and everything inside it will be run in a worker
class SomeRunner(Runner):
def setup_run(self):
self.instance_variable = SomeValueThatCantBePickled
def run_element(self, element):
return self.do_something_with(self.instance_variable)
This makes it much more clear what will be run on the main thread, and what will be run on a worker. It should cut down on confusion, but it's also... really ugly. Who wants classes inside classes? Double yuck.
So, I'm still trying to figure out this balance: making it very clear what is running where, without introducing any extra effort on the part of the user, and without ending up a total mess. Does this make any sense, or am I way off in the parallel weeds here? Let me know if you have any thoughts!
Sorry for the delay here. Thank you for this reasoning, it's all excellent thinking. Mtriage's proposed value-add is exactly as you say, keeping selector/analyser logic separate from orchestration detail, so this is right on it.
One partial solution that might be reasonable for a first implementation using multiprocessing
is to have parallelisation as a flag, to enable it by default, and to revert it to disabled if there is something that can't be pickled in the implementation.
One of the primary aims in #44, which should be merged today, is to limit the data that is needed for user-specified methods entirely to those that are passed as arguments. This may partially address the issue you raise in the 2nd solution, in the sense that the only methods on the class that a user need worry about are index
, retrieve_element
, and analyse_element
(note that these exposed method names have changed in #44 to better express their function). I agree that this could get hairy further down the line, as this assumption is fragmented as the framework becomes more complex.
I don't like the 3rd solution for the same reasons you suggest-- and more particularly because it means that the user has to deal with the extra burden of parallel abstraction, throwing mtriage's value-add into question.
Given the current state of things, I would say solution 2 seems the best way forward. This also concretises the design decision that users should have to detail with nothing but the arguments that are passed to methods exposed. Does this make sense to you, or am I missing something?
The auto-fallback-on-pickle-failure is brilliant, but thanks to #44, I don't think it'll be necessary.
As I see it, the core problem with multiprocessing was mutable state stored on Analyser
/ Selector
instances. With this much more functional approach in #44, it's a lot easier to reason about where data is stored and how it'll be passed into user code; it removes a lot of the uncertainty I was worried about.
I've got a bit of a busy week ahead, but with this nicely cleared path, I should have an implementation this weekend. Thanks again!
The APIs of both selectors and analysers have been designed with parallelisation in mind.
Selectors expose a
retrieve_row
function that operates element-wise. (This should probably be renamed something likeretrieve_element
for consistency's sake.)Analysers expose a
run_element
function that also operates element-wise.Mtriage should optionally allow parallel processing of both of these methods, as both the retrival and analysis of any given element is sandboxed.