DerekGloudemans / distributed-detection

Performs stream processing (nominally object detection) across multiple worker processes with decentralized load balancing, decentralized database with eventual consistency, and basic fault monitoring.
0 stars 1 forks source link

Load balancing skips images #10

Open DerekGloudemans opened 5 years ago

DerekGloudemans commented 5 years ago

Due to the method of searching all received wait times within a given valid time window, it is possible for both processes to find a min_time lower than their own for the same frame. Thus, the method for time storage needs to be adjusted. As a secondary issue, while the image processing is going on, the process is busy, load balance messages can time out during this waiting.

DerekGloudemans commented 5 years ago

With sufficiently high heartbeat rate, the problem more or less goes away. Conceivably it is possible that the load balancer thread is called immediately after the image processing completes before a heartbeat message can be received and added to lb_result_queue. If this problem is noticed, could add a small sleep increment in load balancing function

DerekGloudemans commented 5 years ago

Issue may be caused by the discrepancy indicated above. Raising the heartbeat rate doesn't help because the old, lower wait times are still present in the list of wait times. Thus, need to check each time against other times with the same worker num and only keep the newest. Will use a dict keyed with worker nums.

DerekGloudemans commented 5 years ago

Still can occasionally skip a frame if the most recent wait_time values for all workers are slightly out of sync such that each worker believes the lowest wait time is lower than its own. Tuning the heartbeat rate and load_balancing timeout based on the specific task and processing time can more or less mitigate this.