J535D165 / recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
http://recordlinkage.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
957 stars 153 forks source link

Window indexing algorithm #52

Open kevohagan opened 6 years ago

kevohagan commented 6 years ago

I'm kinda stuck on trying to do a custom blocking function :

I would want to index by a date interval between df_a['start'] + 2 days >= df_b['start'] let's say.

I just can't figure out how to implement a function to return a multiIndex like this. Any clues?

Thank you so much for a such a great toolkit :) !

kevohagan commented 6 years ago

@J535D165 would you have maybe an idea on how to do this? :/ thanks!

J535D165 commented 6 years ago

hello @kevohagan

You are looking for an Adaptive Sorted Neighbourhood Indexing method. This is not implemented, but in your case, you can easily get very similar results with the Sorted Neighbourhood Indexing method.

# Convert the start day to a number. 
df_a['start_unix'] = (df_a['start'] - pd.datetime(1970, 1, 1)).days
df_b['start_unix'] = (df_b['start'] - pd.datetime(1970, 1, 1)).days - 1

# SNI indexer
indexer = recordlinkage.SortedNeighbourhoodIndex(left_on='start_unix', right_on='start_unix', window=3) 
indexer.index(df_a, df_b)

Or do your own merge (check the source code of BlockIndex and SNI) for details.

Hope it helps. I will take a look at how we can support an algorithm like this.

kevohagan commented 6 years ago

Okay thank you for the hint! I'll have a try and let you know :)