Revamp the `filter_accuracy` step

shankari commented 3 years ago

It was the most resource intensive step way back in 2015. But it is also a bit weirdly written, and was before the advent of indices. Let's revisit it and see if we can make it be performant.

shankari commented 3 years ago

Question #1: why the heck am I checking for duplicates? check_prior_duplicate looks to see if there is a prior entry with the same latitude and longitude.

Ah, it is trying to duplicate this code from the data collection

154                 assert(last10Points.length > 0);
155                 if (simpleLoc.distanceTo(last10Points[last10Points.length - 1]) != 0) {
156                     validPoint = true;
157                 } else {
158                     Log.i(this, TAG, "Duplicate point," + loc + " skipping ");
159                 }

But that just compares the current point to the last point. While the code on the server compares against all previous entries

    duplicates = df.loc[0:idx-1].query("latitude == @entry.latitude and longitude == @entry.longitude")

We should be able to simplify that to only look at the last one, but an even better version would just be to drop duplicates. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

shankari commented 3 years ago

Next, instead of check_existing_filtered_location, which makes multiple database calls, we can read the filtered locations in a batch and use pandas to find existing entries. We can then drop the existing entries to find the ones to insert.

And finally, instead of making multiple calls to the database to clone the unfiltered entries, we can read all the unfiltered entries at the beginning and simply iterate over them in memory to create the filtered clones.

shankari commented 3 years ago

For completeness, it seems like we should really save the entries generated this way into the analysis results and not the filtered_location, since the pipeline runs are assumed to be idempotent.

However, the filtered location on the phone is also generated as the result of analysis - in fact, we are running the same analysis on the phone and the server. So why would one be in the timeseries (as background) and the other be in the analysis_timeseries (as analysis).

Should we save the filtered_location from the phone also in the analysis database? But that analysis drives the trip end detection, which does affect the raw data. Deleting those entries will obfuscate how/when we detected trip ends on the phone.

So for now, we will save the background/filtered_location entries in the timeseries under the theory that the filtered_location is a simple filtered version of the location, computed either on the phone or the server.

If we deal with it on the phone, the entries come from the phone. If we don't, the server fills in the entries. The rest of the pipeline assumes that the entries exist.

e-mission / e-mission-docs

Revamp the `filter_accuracy` step #637