forensic-architecture / mtriage

framework to orchestrate the download and analysis of media
Other
98 stars 16 forks source link

parallelise selecting and analysis #30

Closed breezykermo closed 4 years ago

breezykermo commented 5 years ago

The APIs of both selectors and analysers have been designed with parallelisation in mind.

Selectors expose a retrieve_row function that operates element-wise. (This should probably be renamed something like retrieve_element for consistency's sake.)

Analysers expose a run_element function that also operates element-wise.

Mtriage should optionally allow parallel processing of both of these methods, as both the retrival and analysis of any given element is sandboxed.

breezykermo commented 5 years ago

@RaphiePS here's an issue for parallelisation as discussed in #10 . Would love your help on this one!

Here are your suggestions reprinted, I'll address in subsequent comments!

breezykermo commented 5 years ago

I agree that multiprocessing is the most appropriate here. The notion of an 'element' in mtriage is constructed specifically as an independent media or set of media-- so if the analysis or retrieval of an element starts begging for shared memory, it a sign that the media probably isn't appropriate to be treated as an element in the first place. :bowtie:

breezykermo commented 5 years ago

I like the idea of finding a simpler solution than redis. When I first built mtriage it actually used a Redis store to keep track of elements, but at some point I removed it and just used the filesystem to avoid technical debt down the line. Agree that we might run into trouble with CSV files with parallelisation, though.

I haven't used dataset before personally, but I know @pudo and would love to dig into it. Hi Fred!

breezykermo commented 5 years ago

The parallelising superclass is an interesting suggestion. I'll admit that I'm not 100% on how it could work. I would have thought that trying to comport both selector and analyser to a superclass might be doing too much all at once. Personally I would start out treating the parallisation separately, and then see what can be shared after the implementations are working.

But that's just me! If you're confident about it, go for it and then maybe I'll be able to see its light.

breezykermo commented 5 years ago

Potentially interesting: https://towardsdatascience.com/10x-faster-parallel-python-without-python-multiprocessing-e5017c93cce1

RaphiePS commented 5 years ago

Ray looks really compelling! Performance-wise, it seems to help out most when there's shared data or communication, both of which we don't, at least for the time being, have to deal with. What I'm really interested in is their "developer ergonomics."

One of my big design goals for the parallelism feature is that the user shouldn't know or care about parallelism. When I say "user" I mean a person who wants to write a custom Analyser or Selector without digging into the MT framework itself. They should write their code as if it's running only one copy, and MT should orchestrate everything behind the scenes. But things get kind of weird with multiprocessing. For example, say we run an Analyser like this in parallel:

class SomeAnalyser(Analyser):
    def setup_run(self):
        self.instance_variable = SomeValue

    # this will be run in parallel in a multiprocessing.Pool
    def run_element(self, element):
        return self.do_something_with(self.instance_variable)

    # this method would be inside the Analyser superclass,
    # not re-implemented by each individual Analyser
    def run(self):
        self.setup_run()
        pool = multiprocessing.Pool(4)
        elements = [1, 2, 3]
        for result in pool.imap_unordered(self.run_element, elements):
            print(result)

When you run an instance method (in this case run_element) with multiprocessing, it will serialize the entire instance using pickle and ship it off to a worker process. The worker will deserialize the instance, run the method, and send back a pickled result. This is all fine and dandy until you have something like:

def setup_run(self):
    self.instance_variable = SomeValueThatCantBePickled

There are lots of objects that can't be pickled - files, locks, database handles - and it'll blow up if you try to run it through multiprocessing. Yuck.

We could tell the user "just make sure you can pickle everything!" but that's super lame - it means they have to partially take on the burden of parallelization, and there might be necessary libraries that just can't be pickled.

Another solution might be: instead of executing setup_run once in the main thread, execute it inside the worker process, once for each element. The code would look like this:

# this function will be executed inside the worker process
def run_element_wrapper(element):
    brand_new_instance = SomeAnalyser()
    brand_new_instance.setup_run()
    brand_new_instance.run_element(element)

def run(self):
    self.setup_run()
    pool = multiprocessing.Pool(4)
    elements = [1, 2, 3]
    for result in pool.imap_unordered(run_element_wrapper, elements):
        print(result)

This works on a technical level and solves the "pickle problem," because unpickable values no longer need to be sent to the worker thread. Instead of serializing the existing instance and sending it to the worker, we just create a new instance for each worker, eliminating the pickling.

But I think it's pretty confusing for users. This would mean there are some methods in an Analyser / Selector that should are called on the main thread (__init___ and index), and some that should be called inside a worker process (setup_run and run_element). It'd be impossible to know which is which without referring to the docs. Mistakes would be subtle and hard to catch. It's too implicit.

A much more explicit version might have the user put all the worker-run code in a separate class:

class SomeAnalyser(Analyser):
    # this, and everything in the top-level SomeAnalyser class,
    # will be run on the main thread
    def __init__(self):
        self.other_main_thread_method()
        self.set_runner_class(SomeRunner)

    # also run on main thread
    def other_main_thread_method(self):
        pass

    # this class and everything inside it will be run in a worker
    class SomeRunner(Runner):
        def setup_run(self):
            self.instance_variable = SomeValueThatCantBePickled

        def run_element(self, element):
            return self.do_something_with(self.instance_variable)

This makes it much more clear what will be run on the main thread, and what will be run on a worker. It should cut down on confusion, but it's also... really ugly. Who wants classes inside classes? Double yuck.

So, I'm still trying to figure out this balance: making it very clear what is running where, without introducing any extra effort on the part of the user, and without ending up a total mess. Does this make any sense, or am I way off in the parallel weeds here? Let me know if you have any thoughts!

breezykermo commented 5 years ago

Sorry for the delay here. Thank you for this reasoning, it's all excellent thinking. Mtriage's proposed value-add is exactly as you say, keeping selector/analyser logic separate from orchestration detail, so this is right on it.

One partial solution that might be reasonable for a first implementation using multiprocessing is to have parallelisation as a flag, to enable it by default, and to revert it to disabled if there is something that can't be pickled in the implementation.

One of the primary aims in #44, which should be merged today, is to limit the data that is needed for user-specified methods entirely to those that are passed as arguments. This may partially address the issue you raise in the 2nd solution, in the sense that the only methods on the class that a user need worry about are index, retrieve_element, and analyse_element (note that these exposed method names have changed in #44 to better express their function). I agree that this could get hairy further down the line, as this assumption is fragmented as the framework becomes more complex.

I don't like the 3rd solution for the same reasons you suggest-- and more particularly because it means that the user has to deal with the extra burden of parallel abstraction, throwing mtriage's value-add into question.

Given the current state of things, I would say solution 2 seems the best way forward. This also concretises the design decision that users should have to detail with nothing but the arguments that are passed to methods exposed. Does this make sense to you, or am I missing something?

RaphiePS commented 5 years ago

The auto-fallback-on-pickle-failure is brilliant, but thanks to #44, I don't think it'll be necessary.

As I see it, the core problem with multiprocessing was mutable state stored on Analyser / Selector instances. With this much more functional approach in #44, it's a lot easier to reason about where data is stored and how it'll be passed into user code; it removes a lot of the uncertainty I was worried about.

I've got a bit of a busy week ahead, but with this nicely cleared path, I should have an implementation this weekend. Thanks again!