DerekGloudemans / distributed-detection

Performs stream processing (nominally object detection) across multiple worker processes with decentralized load balancing, decentralized database with eventual consistency, and basic fault monitoring.
0 stars 1 forks source link

Distributed Detection

This repository is a final project for my Resilient Distributed Systems class. A multi-worker image processing system is implemented. Images are received at regular intervals from one or multiple sources and worker processes detect objects in these images. Workers distribute computational load in a decentralized manner, and data is stored in a decentralized manner as well. The database is structured for availability, functionality when partitioned, and eventual consistence (assuming no malicious workers). A simple monitoring process is also used for basic fault detection and management. The following requirements will be met:

Directions for Running

It's a bit of a pain to run, sorry. I didn't have quite enough time to get the code into a nice, cleanly packaged application, so here's how it goes:

To run this code, you'll need three separate consoles. In the first, cd into the distributed-detection directory and run:

python query_publish.py x y x - float - average time between sending queries (seconds) y - int - number of workers (controls which workers are sent queries)

In the second window, cd into the distributed-detection directory and run:

python im_publish.py x directory x - float - time between sending images (seconds) directory - string - path to a folder with images in it which will be iterated over and sent to workers

In the third window, cd into the distributed-detection directory and run: python main.py

This will run the distributed stream processing code with 4 worker processes in the normal mode. If you want to introduce faults, you'll have to go to various locations in the code and set boolean values to True. This is somewhat clunky but is all I had time for.

Lastly, the default configuration uses dummy work. If you have pytorch and the other necessary packages installed on your machine, you can switch to real object detection work rather than dummy work by changing line 722 to False. Note that in this case you'll want to run different worker processes on different machines because pytorch networks are resource intensive and it's difficult for a machine to run multiple pytorch model instances in parallel. For my test case, I ran 2 worker processes using 2 GPUs on the same machine. If you want to do this, change line 722 in worker_thread_fns.py to False and run main_2_workers.py

Design Specification

1. System Actors

The main actors in the system are the worker processes which perform object detection on incoming images and store results. The monitoring process is an additional actor. While the image-sender and database query clients are involved in the operation of the system, they are considered separate to the system (i.e. the system is not designed to be robust to their failure and it is assumed that they do not deviate from expected operating behavior.

2. Summary of System Workflow

Each worker process is designed to operate more or less independently from each other worker process. While there is communication between workers to coordinate load balancing, majority responses to query requests, results audits and database consistency, all of these actions are performed with a fixed time timeout such that a partitioned system can remain available and is not deadlocked waiting for responses. Note that in the case when no workers respond within the fixed timeout, the worker process performs as a single agent would. The main components of each worker process are explained below:

Worker Process Shared Objects

The rough process flow diagram for one worker process shown below.

Data is stored in JSON or similar files. To keep things simple, the tester script will evaluate the system on a single machine in multiple processes. The neural network detection will thus be carried out on GPU by one process, and cpu by other processes, creating an interesting disparity in processing speed.

3. External Libraries and Services Required

Multiprocessing and multithreading modules will be used, as well as ZMQ middleware for message passing. Additionally, pytorch will be used for object detection neural network.

4. System Failure Modes and Anomaly Detectors

The following failures are possible and will be tested by artificial introduction. The anomaly detection method is listed in parentheses:

The following failures are not accounted for or are considered outside the scope of the system:

5. Failure Mitigation Strategies