GPSBabel / gpsbabel

GPSBabel: convert, manipulate, and transfer data from GPS programs or GPS receivers. Open Source and supported on MacOS, Windows, Linux, and more. Pointy clicky GUI or a command line version...
https://www.gpsbabel.org
GNU General Public License v2.0
475 stars 126 forks source link

improve correctness and speed of duplicate filter. #1144

Closed tsteven4 closed 1 year ago

tsteven4 commented 1 year ago

The duplicate filter could erroneously delete points that were not duplicates if the crc's happened to match.

The complexity remains O(n^2), however the coefficient is reduced resulting in shorter runtimes. The complexity is not changed by the often useless sort.

waypt_del(Waypoint*) is inefficient as it requires a search of the list to find the matching waypoint. Support waypt_del with iterators.

This test uses the indicated number of waypoints, which were all duplicated, and then the order was randomized. The blue trace is without this PR, the red trace includes just using iterators to delete the duplicates, the yellow trace also uses QMultiHash for simplicity and correctness. dup

Sorry to rain on the btree forest, but it's 2023 and we have libraries available.

tsteven4 commented 1 year ago

This would allow us to retire util_crc.cc as well.

tsteven4 commented 1 year ago

We can do better by swapping the waypoint list and just adding back the points we want as in https://github.com/GPSBabel/gpsbabel/pull/1144/commits/d0bd0fd0777a942ae47827967b4aa88c178e0517

All my testcase times shifted for an unexplained reason, so I reran all the cases. The new green line is with the latest with the waypoint list swap and rebuild.

dupl

robertlipe commented 1 year ago

Actually, other than in unicsv (where we cargo cult pretty much all members of Waypoint - and which didn't exist in the Parker era - the duplicate filter is OLD) is gcdata.exported ever actually set?

It looks like that's now a vestigial read-only variable. We used to parse it in gpx.c, but we haven't done that in years and it's not an official tag in the groundspeak schema.

I don't consider its existence in unicsv as proof that it's used/useful.

Give it a nod of agreement and I'll nuke it.

GPSBabelDeveloper commented 1 year ago

I dut for it. https://github.com/GPSBabel/gpsbabel/commit/93c71e4ff6d9a7e4bddcddad1bdbfc6ef7d7caae

2003 was early days for Pocket Queries (groundspeak/geocaching GPX files) and it was common to have to order up many of them to cover an area. It looks like this was part of his bespoke system for managing multiples so that records from multiple dates that tied into the (taa daa) duplicate filter so he could end up with one PQ that contained the most recently updated records. This was presumably easier than doing actual database stuff and cracking open the logs tags and merging those, but keeping the most recent top-level geocache information for most recent coords, cache description, and the much-needed enabled/archived flags so you wouldn't go into the woods and hunt for a geocache that wasn't there.

These tags didn't exist outside of his personal geocaching tools. I'll prep a CL that removes this in a few moments...

Message ID: @.***>

codacy-production[bot] commented 1 year ago

Coverage summary from Codacy

Merging #1144 (4ff119ca3e6a4801044712f4aa79d4ebff0869f2) into master (f73ac5cfed47217045073dc6bc2f00d4c5547bbc) - See PR on Codacy

Coverage variation Diff coverage
-0.02% 86.67%
Coverage variation details | | Coverable lines | Covered lines | Coverage | | ------------- | ------------- | ------------- | ------------- | | Common ancestor commit (f73ac5cfed47217045073dc6bc2f00d4c5547bbc) | 23091 | 15995 | 69.27% | | | Head commit (4ff119ca3e6a4801044712f4aa79d4ebff0869f2) | 23068 (-23) | 15974 (-21) | 69.25% (**-0.02%**) | **Coverage variation** is the difference between the coverage for the head and common ancestor commits of the pull request branch: ` - `
Diff coverage details | | Coverable lines | Covered lines | Diff coverage | | ------------- | ------------- | ------------- | ------------- | | Pull request (#1144) | 30 | 26 | **86.67%** | **Diff coverage** is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: `/ * 100%`

See your quality gate settings    Change summary preferences

robertlipe commented 1 year ago

Once you've landed this, if you agree it's the thing to do, we'll knife 'exported' with: https://github.com/GPSBabel/gpsbabel/commit/6996e4d7dd4b1812f922edd2c5bd9a3d1bc6fada

If you want to just pull that into your commit, consider it pre-approved.

tsteven4 commented 1 year ago

We run through the hash results and mark the waypoints to be deleted (L124). Then we swap the global_waypt_list with a local empty list, oldlist. Immediately after the swap global_waypt_list is empty, and oldlist has all the waypoints. We go through oldlist and either add the waypoints back to the global_waypt_list, or delete them. The waypoints themselves are never copied, these lists are lists of waypoint pointers.

I have known since converting our double ended queue to WaypointList that waypt_del(Waypoint* wpt) was inefficient as it involved n/2 operations on average searching the list looking for the wpt. We can avoid that by using iterators. What I didn't realize is that QList::erase(QList::iterator pos) also apparently involves n/2 operations on average! This leads to complexities of O(n^2) as we are deleting n/2 points. We can quickly add to the end of a QList, but deleting an element from the middle, even with erase, is slow. This is not true of std::list which "supports constant time insertion and removal of elements from anywhere in the container". When I created WaypointList I anticipated we might want to back it with std::list instead of QList. If we did so then I think we could just use erase (as in intermediate commit 04f534faffc00492851c0d68757718e57cc826860) instead of swapping and rebuilding the global_waypt_list with the wpts we want to keep.

I am using Libre Office Calc to generate the graphs. It supports adding various trend lines which you see, along with their equations and the coefficient of determination. One must be careful to add sufficient points to discriminate between different trend lines when assessing the complexity.

tsteven4 commented 1 year ago

BTW, in my test case, with all wpts using empty gc data, the sort is irrelevant to performance. But I agree, the sort based on exported always struck me as a hidden pet case. I don't think it is mentioned in the documentation at all.

robertlipe commented 1 year ago

Ah.lovely. I hadn't really noticed that waypoint:swap swaps the waypoint LISTS and not actually waypoints. Dig it!

There was talk within the Qt community about turning down some of the Qt containers (some are better, some are worse) and/or backing them with std::foo now that you can count on good implementations everywhere - which was definitely not the case in Qt3 days. I've fallen out of the insiders loop and don't know where that wind is now blowing. DateTime and String are now pretty entrenched, but I think we can swap most of the containers out without a big deal.

I wouldn't casually mix QFoo and std::Foo, but with a good reason, my loyalty could swing.

In my followup deleting exported (I couldn't find a sane case that ever set it...) that sort gets deleted, too. That'll buy some clock cycles back in your case.

When everything was a linked list, waypt_del() was cheap as it was basically two pointer moves. Our foundational containers just changed over the years.

Thanx!

On Mon, Jul 24, 2023 at 3:50 PM tsteven4 @.***> wrote:

BTW, in my test case, with all wpts using empty gc data, the sort is irrelevant to performance. But I agree, the sort based on exported always struck me as a hidden pet case. I don't think it is mentioned in the documentation at all.

— Reply to this email directly, view it on GitHub https://github.com/GPSBabel/gpsbabel/pull/1144#issuecomment-1648594361, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCSD3Y6XLVGGJ4TB7HLIRTXR3NXXANCNFSM6AAAAAA2T7TKHU . You are receiving this because your review was requested.Message ID: @.***>

tsteven4 commented 1 year ago

To quote Qt: "For most applications, QList is the best type to use. It provides very fast appends. If you really need a linked-list, use std::list."

GPSBabelDeveloper commented 1 year ago

There's some truth to that. We started with a linked list.

On Mon, Jul 24, 2023 at 4:43 PM tsteven4 @.***> wrote:

To quote Qt: "For most applications, QList https://doc.qt.io/qt-6/qlist.html is the best type to use. It provides very fast appends. If you really need a linked-list, use std::list."

— Reply to this email directly, view it on GitHub https://github.com/GPSBabel/gpsbabel/pull/1144#issuecomment-1648669403, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3VAD6YMCUFJ2SJQU2H47LXR3UAVANCNFSM6AAAAAA2T7TKHU . You are receiving this because you commented.Message ID: @.***>

tsteven4 commented 1 year ago

You pushed https://github.com/GPSBabel/gpsbabel/pull/1144/commits/6996e4d7dd4b1812f922edd2c5bd9a3d1bc6fada into this PR. I intend to revert it and merge this. Then you can add it to another PR.

robertlipe commented 1 year ago

That sounds the opposite of helpful. Sorry and thanx.

On Mon, Jul 24, 2023 at 7:28 PM tsteven4 @.***> wrote:

Merged #1144 https://github.com/GPSBabel/gpsbabel/pull/1144 into master.

— Reply to this email directly, view it on GitHub https://github.com/GPSBabel/gpsbabel/pull/1144#event-9905899593, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCSD34F6FXQK6OGQ64KLH3XR4HLFANCNFSM6AAAAAA2T7TKHU . You are receiving this because your review was requested.Message ID: @.***>

tsteven4 commented 1 year ago

If helpful I can create a branch with your changes on top of the new master, which I think is what you wanted in #1145.