CrisisCleanup / crisiscleanup-1

[OLD] Legacy Crisis Cleanup on GAE/Python
https://sandy-disaster-recovery.appspot.com
Other
8 stars 4 forks source link

Duplicate Detection #83

Closed aarontitus closed 11 years ago

aarontitus commented 11 years ago

Original author: v...@aarontitus.net (November 15, 2012 19:54:13)

We need to detect duplicates and give people an opportunity to merge. This is a very very high priority. I was two-thirds done writing use cases explaining how to handle various cases and exceptions. And then my computer rebooted itself and I lost everything.

But I need something in here as a placeholder so here it is.

Original issue: http://code.google.com/p/sandy-disaster-recovery/issues/detail?id=83

aarontitus commented 11 years ago

From v...@aarontitus.net on January 14, 2013 03:02:37 Merging happens at two stages:

  1. Entry: A volunteer conducts an assessment, and the client already has a work order in the system. The system should detect the existing work order, and give an opportunity to use the existing work order instead of creating a new work order.
  2. Audit: A volunteer finds two work orders that are obviously duplicates, and wants to merge them without destroying any relevant information.

Merge Feature- for duplicate work orders already in the system, combine and keep both work order numbers. If not the same, then consider changing the lat/long by 100 feet so that the icons aren’t on top of one another. Just an idea- not required.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 18, 2013 00:19:00 The duplicate detection engine should also work during the Import process (#133).

aarontitus commented 11 years ago

From v...@aarontitus.net on March 18, 2013 00:30:21 For ENTRY duplicate detection, I think there may be a lightweight pre-duplicate detection validation we could implement by applying the same AJAX we use in the map search feature (#63). For example, if I type, "Scot" in the name, address, city or phone number, then the following results may automatically appear under the entry. By selecting one of them, I would be taken directly to the "Edit" page: <Scott Smith, 123 Main Street, Union City NJ 01234> <Jim Jones, 476 Scotch Plains Ave, Rockaway, NY 12345> <Fria Lima, 9377 Home St, Scottsdale, CT 23905>

This would require that the map AJAX be loaded each time someone enters an assessment form. We may want to find a way to cache that once per session for the assessment form. But it seems like a quick way to minimize entry duplicates.

NOTE: This is NOT system-driven duplicate detection, and this process will not detect duplicates imported

aarontitus commented 11 years ago

From cpw...@gmail.com on March 18, 2013 16:52:55 Agreed - this is possible if the site data 'terms' used in the search autocomplete are made available on the form .

The search terms can cached to local storage in a cross-compatible-browser way and then queried as you suggest. I will try this.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 18, 2013 23:12:19 A possible behaviour for #4 is now live on the testbed (v155).

(There is no caching to local storage - the map page doesn't do any itself - added as #219).

aarontitus commented 11 years ago

From cpw...@gmail.com on March 18, 2013 23:13:44 I say "possible" because I can think of a few different ways of doing this, but it's not obvious what would be best wrt UI - so see what you think.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 19, 2013 15:40:58 Having played with #4 more, the behaviour of filling fields is annoying at best! It needs to get out of the way if the user isn't interested.

I will disable it and try to improve it.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 19, 2013 17:51:25 Backend-powered duplicate detection is now live on the testbed (v156), using double metaphones [1], for the form only.

Try entering misspelled(ish) names and addresses, and also note the new "Ignore similar matches" checkbox

Not currently integrated with importing CSVs.

[1] http://en.wikipedia.org/wiki/Metaphone

aarontitus commented 11 years ago

From cpw...@gmail.com on March 19, 2013 17:53:56 Entry suggestions (#4) are temporarily disabled - split out to #220 to keep this one about the backend.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 19, 2013 18:24:44 Similarity matching algorithm as of v156 (from doc of find_similar()):

Two sites are similar if at least one of:
(i) Their normalised phone numbers are the same.
(ii) Their name and address metaphones and digits in address all match.

Note: current similarity algorithm does not check city field due to use of synonyms in the field (e.g. "New York" vs "New York City" vs "NYC")

aarontitus commented 11 years ago

From cpw...@gmail.com on March 19, 2013 18:39:00 Duplicate detection during importing CSVs is live on the testbed (v156).

Login as admin and see e.g. https://sandy-helping-hands.appspot.com/admin-import-csv/active?id=280002

aarontitus commented 11 years ago

From v...@aarontitus.net on March 19, 2013 23:00:05 Can two sites be similar if their lat/lon combination are within a ~25 feet of one another? How hard would that be to do?

aarontitus commented 11 years ago

From cpw...@gmail.com on March 20, 2013 10:15:20 Proximity searches as a kind of geospatial query are possible, e.g. as per https://developers.google.com/maps/articles/geospatial

The hack way of doing it is to quantize the co-ordinates to a grid and search on grid co-ordinates, which is strictly less accurate.

Either way, how should it fit in with the algorithm? Should it take the place of the address-related similarity matching in the algorithm in #12?

aarontitus commented 11 years ago

From v...@aarontitus.net on March 20, 2013 15:59:45 It seems to me that geospatial proximity searches will be the most accurate. Sometimes you might have a case worker as the primary contact, or a neighbor with a working phone.

I think that the geospacial query should be first in the algorithm, and that address-related similarity should be secondary. I still would like the address and phone number check, just in case Google geocodes incorrectly.

Does that answer your question?

aarontitus commented 11 years ago

From cpw...@gmail.com on March 20, 2013 23:03:46 Yes, but I think using geocoded co-ords will be all-or-nothing - almost every addresses geocodes to somewhere (and if it doesn't, the work order gets rejected in the import csv case). So there's no need to consider the metaphonic similarity of address if it doesn't geocode.

Regarding similar names, maybe drop/ignore this? What are the cases?

aarontitus commented 11 years ago

From v...@aarontitus.net on March 20, 2013 23:54:56 Yes, while virtually all addresses will geocode, my experience has been that they will not always geocode correctly. This is often due to user error. There have been several instances when the geocoordinates were set to 0,0 (no idea how this happened- assume User error). There are other times when the user will enter information incorrectly, e.g. "Rolly Rd." instead of "Raleigh Rd." In these instances, Google will make an educated guess. Most of the time Google guesses right. Sometimes it says, "I don't know," but sometimes it guesses wrong, and the user blindly accepts Google's guess. Keeping the name and phone number check as secondary reminders would be very helpful, especially when information is entered by a live person.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 25, 2013 15:59:47 New algorithm, using geospatial proximity search, is:

Two sites are similar if at least one of:
(i) Their geocoded co-ords are within 8 metres of each other.
(ii) Their name metaphone and normalised phone numbers match.

(geo index under batch construction now)

aarontitus commented 11 years ago

From cpw...@gmail.com on March 25, 2013 16:08:39 Live now on production - version 159.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 25, 2013 16:24:07 Cool. Comments:

aarontitus commented 11 years ago

From v...@aarontitus.net on March 25, 2013 16:27:52 Also, I think 8 meters is too much. In tightly-packed homes (e.g. on breezy point), some lots aren't much more than about 5 meters across. I'd like to bump the sensitivity down to about 4 meters, if that's OK.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 25, 2013 16:29:14 Unexpected bug: In the infobox, the following items appear: City metaphone: NPTN-. Name metaphone: PPTSNKTS. Address metaphone: RFRFKRT. Phone normalised: 7328046041.

Is there any way we can hide that information from the InfoBox?

aarontitus commented 11 years ago

From cpw...@gmail.com on March 25, 2013 16:53:24 At present, Infoboxes are only available for display on the map view, not the form. view.

As a stop gap, how about opening a new tab on click of the error message?

aarontitus commented 11 years ago

From v...@aarontitus.net on March 25, 2013 16:59:41 Sure. That's an OK stop gap for the Infobox issue. I worry about someone who has interviewed a person over the phone for 8 minutes, clicks enter, only to find that it's a duplicate, then when they edit, all of their notes from the past 8 minutes are lost.

At least opening up the view/ edit would allow them to save their work.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 25, 2013 17:12:27 Understood re losing data.

New warning message, 4 metre radius, infobox alteration now live on testbed.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 25, 2013 17:33:57 On sandbox, please check A3423 and A1861. They are duplicates, but were not flagged as duplicates. (107 Wakewood Rd) Any ideas?

aarontitus commented 11 years ago

From cpw...@gmail.com on March 25, 2013 17:47:41 The saved co-ordinates are different, so there's 1.5 km between them:

https://sandy-helping-hands.appspot.com/edit?case=A1861 https://sandy-helping-hands.appspot.com/edit?case=A3423

=>

http://www.wolframalpha.com/input/?i=distance+between+%2840.2091219%2C+-74.0386271%29+and+%2840.196577%2C+-74.044195%29&a=*C.distance-_*GeoQueryType-

aarontitus commented 11 years ago

From cpw...@gmail.com on March 25, 2013 17:50:36 A1861 could have used old geocoding data, at the point it was geocoded?

It geocodes to the same as A3423 when rerun now (my clicking in to an address field).

aarontitus commented 11 years ago

From v...@aarontitus.net on March 25, 2013 18:28:21 That's strange; especially since the two appeared on the map at exactly the same location. Perhaps the system does not actually use the lat/lon fields to map? Perhaps the Google Maps API just reads the address de novo each time. I dunno... it's a strange behavior.

Here's something else interesting- I allowed Google maps to re-geocode A1861. As expected, it geocoded the same as A3423. I saved it, and it did not flag as a duplicate (probably because I am editing, rather than creating a new record).

It would be good to check for duplicates on save, as well. Allow the save to occur, and provide links to each of the duplicates. Then you can (optionally) go edit them and change the status to "Closed, duplicate"

It's a good way to clean up the database as we move forward.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 25, 2013 19:04:15 I see - it's usefully different behaviour. Ok, will make a change to edit-saves.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 25, 2013 19:16:31 Currently, after an saving an edit, the map page is shown with the infobox of the work order just saved.

Three options then: (i) Show suspected duplicates on this infobox, (ii) Show suspsected duplicates on all infoboxes (simpler; also more powerful?), (iii) introduce new UI

aarontitus commented 11 years ago

From v...@aarontitus.net on March 25, 2013 19:33:18 (i) seems a little crowded, but theoretically possible. (ii) I can't exactly visualize how this would work, but I like simpler and more powerful. (iii) Seems unnecessary.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 25, 2013 20:01:44 Changed "To ignore this warning, select the checkbox at the bottom of this form." -> "Ignore & Save" button.

Deployed to testbed.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 25, 2013 20:13:26 Much better. I like the Ignore & Save button.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 26, 2013 01:27:09 Important caveat to Comment 19 algorithm:

If {the address line (not State) contains any one of the following non-case sensitive complete words (e.g. having a line break or space before and after): ["#", "Suite", "Ste", "Apartment", "Apt", "Unit", "Department", "Dept", "Room", "Rm", "Floor", "Fl", "Bldg", "Building", "Basement", "Bsmt", "Front", "Frnt", "Lobby", "Lbby", "Lot", "Lower", "Lowr", "Office", "Ofc", "Penthouse", "Pent", "PH", "Rear", "Side", "Slip", "Space", "Trailer", "Trlr", "Upper", "Uppr"] Then assume that it's an apartment and therefore {Two sites are similar if {Their name metaphone and normalised phone numbers match.}}

Else {Two sites are similar if at least one of:( (i) Their geocoded co-ords are within 8 metres of each other. (ii) Their name metaphone and normalised phone numbers match.)}

Let me know if this makes sense.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 26, 2013 11:27:10 If an address value implies an apartment, don't check the geocoded co-ords - ignore address look at name and phone number only?

aarontitus commented 11 years ago

From v...@aarontitus.net on March 26, 2013 14:05:39 Correct. On further reflection, 1. Ignore geocoordinates. 2. Compare address text for exact match (if easy). 3. Look at name and phone.

Is #2 feasible/ easy? If not, don't worry about it; stick with 1 and 3.

aarontitus commented 11 years ago

From cpw...@gmail.com on March 26, 2013 14:42:19 Address comparison is already in the codebase (from the previous design) but disabled.

Comparing is done like the name comparison plus a check of the digits - e.g. "15 park ave" is the same as "15 parcav" but not the same as "16 park ave".

Would exact match be better? I'm not sure of the cases.

aarontitus commented 11 years ago

From v...@aarontitus.net on March 26, 2013 16:01:03 I think comparing the address sounds ("park ave" and "parcav") plus the digits makes the most sense. If that's straightforward, we can do that and close this issue.

aarontitus commented 11 years ago

From cpw...@gmail.com on April 02, 2013 14:12:35 This change is being deployed now. The new algorithm is:

Two sites are similar if at least one of:
(i) The addresses imply an apartment and the addresses and names have
    matching metaphones.
(ii) The addresses do not imply an apartment and the geocoded co-ords
     are within 4 metres of each other.
(iii) Their name metaphone and normalised phone numbers match.
aarontitus commented 11 years ago

From cpw...@gmail.com on April 02, 2013 14:41:42 Deployed to live/production.

aarontitus commented 11 years ago

From v...@aarontitus.net on April 02, 2013 14:56:42 Awesome! Thanks!

aarontitus commented 11 years ago

From v...@aarontitus.net on April 13, 2013 00:31:26 If this is done, go ahead and mark it completed. I'm pretty sure it's done.

aarontitus commented 11 years ago

This issue is complete. Closing out.