Open joshuacroff opened 4 months ago
Thanks for opening this and for providing so much detail! Sorry you're having issues with it.
Should the method logic be updated to process synchronously if processes=1 and use multiprocessing otherwise?
Yes, you're absolutely right, we had the logic mixed up here and so I'll put in a quick fix to make sure the matcher uses multiprocessing only if processes>1
, good find!
Could the need for pickling or serialization in multiprocessing be causing this issue?
Yes, I think so, I traced this back to the rtree spatial index that our road map uses to quickly lookup road geometry. From the docs it seems as if there is a known issue where the rtree cannot be pickled and unpickled correctly.
So, unfortunately it won't be possible to use multiprocessing with this matcher until the pickling issue gets fixed or we find a suitable workaround. Sorry that we don't have a quick fix in this situation but I'll open an issue to explore alternative options for multiprocessing that don't result in errors with the rtree.
@nreinicke thanks for your responsiveness and for making the logic quick-fix! Also no worries on the multiprocessing issue, I totally understand. I burned quite a few hours myself trying to find a work-around and just couldn't quite make it work.
I did have some logic that worked, but required that I create a couple of new functions wrapping existing mappymatch functions/methods that allowed for asynchronous processing. However, the logic kind of hammers the Overpass API and after about a 1000 records or so, I saw a significant slowdown. You can see my attempt here.
I created two functions:
batch_process_traces_parallel()
process_trace
process_trace()
The problem with this approach seemed to be API throttling. I refactored to pull the entire network down within my area of interest to see if removing the constant calls to the API would help. This is when I ran into issues with pickling a matcher object. If you're curious, my current, non-working attempt is here.
I also tried synchronous logic. My thinking was that if I remove the api calls, perhaps I would get adequate enough speed-up that I wouldn't need to implement asynchronous logic. Somehow, that was slower than the synchronous logic where I made repeated calls to the API, perhaps because the network graph is so much larger? I'm now looking into ways to 'clip' the locally cached network to a trace buffer to see if that speeds things up. This has been a huge learning experience for me I'll say. :)
Problem Description
I've encountered an issue when using the
LCSSMatcher.match_trace_batch()
method for batch matching with multiprocessing. Below is the method definition for reference:It's located in the code here: lcss.py#L150
When I run the following code with the default single process:
I get this error message:
Observations
processes=1
).processes
to a higher number (e.g., 8), it processes synchronously using list comprehension without errors, but it's slow:concurrent.futures
also results in the same error:Hypothesis
It seems that multiprocessing might be causing the issue due to object serialization ('pickling').
Questions
processes=1
and use multiprocessing otherwise?Any insights or suggestions would be greatly appreciated.