J535D165 / recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
http://recordlinkage.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
955 stars 153 forks source link

Tool to merge/combine two linked dataframes. #27

Open J535D165 opened 7 years ago

J535D165 commented 7 years ago

After successfully linking two dataframes, there is often a need to make one unified dataframe. There should be a tool to combine the information of both dataframes into one. I think, one tool might not be enough to cover all use cases. So a set of tools might be a good idea. Idea's are welcome.

Similar tools are needed for deduplication. A new issue will be made for that case.

joelbecker commented 7 years ago

@jillianderson8 and myself have been working on designing a new Fuse class which would act as a tool for merging datasets. The basic idea is as follows:

  1. Pass in Compare.vectors, dfA, and dfB.
  2. Extend each row in Compare.vectors using corresponding data from dfA and dfB.
  3. Integrate duplicate columns using a variety of conflict resolution strategies, as described by Peter Christen's data matching book.
  4. Return the integrated data.

We envision the Fuse class to be implemented in a way that is very similar to the Compare class. We have created some draft code showing how we think this class would be implemented (see this Gist). Here's what we think the Fuse API would look like, assuming we're extending the recordlinkage 0.9.0 API:

# Original Data
my_df_a: pd.DataFrame
my_df_b: pd.DataFrame

# Do core data integration analysis
my_comp: rl.Compare
my_classifier: rl.Classifier

# Refine your pairs
my_vectors = my_comp.vectors.iloc[my_classifier.predict(my_comp.vectors)]

# Perform Data fusion
my_fuse = Fuse(my_vectors, my_comp.df_a, my_comp.df_b)
my_fuse.no_gossiping('col1', 'col2')
my_fuse.no_gossiping('col3', 'col4')
my_fuse.trust_your_friends('col5', 'col6', trusted='a')
my_fuse.take_the_information('col7', 'col8')

# Get output
my_integrated_data = my_fuse.fuse()

We plan to implement this feature soon on the networks-lab/recordlinkage branch, to keep up with our lab's internal data integration needs. However, we would appreciate your input on our proposed design, since would like to eventually submit another pull request to integrate this feature into the main recordlinkage repository. We'd be particularly interested in how to design the Fuse class in a way that can eventually be included in the potentially upcoming Pipeline feature. One big question is whether the Pipeline class could facilitate moving from classification to fusion.

I will be gone until next Monday, so won't respond immediately, but I'm looking forward to hearing your thoughts!

J535D165 commented 7 years ago

This sounds very cool. Would love to integrate this in the toolkit.

The following is a write-up of my thoughts. I am open for all ideas.

The API

As told in PR #37, there is a change of the Compare API coming. The Index API has changed already. My goal is to convert each class into a structure similar to the Estimator class in Tensorflow and SKLearn (see this paper https://arxiv.org/pdf/1309.0238.pdf). For the Fuse class, I think this should be like this:

# Initialise Data fusion (No computations or fusing)
my_fuse = Fuse()
my_fuse.no_gossiping('col1', 'col2')
my_fuse.no_gossiping('col3', 'col4')
my_fuse.trust_your_friends('col5', 'col6', trusted='a')
my_fuse.take_the_information('col7', 'col8')

# do the computation and of all the steps above
my_integrated_data = my_fuse.fuse(my_vectors, my_comp.df_a, my_comp.df_b)

In fact, this is not a huge difference. The backend remains almost the same. Please note that all computations are performed when calling the .fuse(...) method. All steps before are used to configure the Fuse class. Hope this sounds clear :) This weekend I will push the new Compare class to the branch new-compare-api where this is implemented.

The advantages are:

Functionality

I am not very familiar with the fusion techniques. The techniques seem to be very useful.

One-to-many matches and many-to-many matches are a great struggle during the fusion process. Maybe an option to enforce one-to-one matching (greedy approach) would be great. This is only useful when linking datasets.

For deduplication, you often see that multiple records belong to the same entity, while they do not form a complete graph. One of my goals is to find a good algorithm to deal with this. For example, record A, B, C, and D form a complete graph except from the fact that one edge is not classified as a match. So we do have 7 edges instead of 8 (ratio 0.875). I would like to have a method which can be set at 0.85 for example. However, this is not very urgent at the moment.

joelbecker commented 7 years ago

Thanks for your input, this is very helpful. Your proposed "initialize first, compute second" model is what I expected based on the Pipeline class you described.

I very much agree that enforcing one-to-one or one-to-many mappings are important. I implemented something similar in my own data fusion workflow, which basically consisted of three steps/functions:

@jillianderson8 and I had decided not to include the ranking/refining stages in the Fuse class, because we figured that restricting its scope to "fuse and resolve conflicts" would be higher cohesion. On the other hand, we think that enforcing mapping is a very important feature, and we would defer to you if you think it fits in the Fuse class.

Going forward, I'll plan to implement Fuse as we've discussed. For now, I will assume that we will include mapping enforcement in Fuse and design the internal interface to support this, and wait to hear whether you think that mapping enforcement fits in Fuse.

joelbecker commented 7 years ago

@J535D165 As I've done more thinking about this, I've realized that this proposed workflow would not handle the case where one wants to integrate a cluster of candidate links into the same record in the fused data set? I have found a number of good resources (e.g. http://ilpubs.stanford.edu:8090/856/1/2007-24.pdf, http://www.bioinf.at/teaching/ss2011/pr-pr5h/data_fusion_a1-bleiholder.pdf, https://edoc.hu-berlin.de/bitstream/handle/18452/3112/197.pdf?sequence=1&isAllowed=y) and so will do more reading before I begin to better understand the entity resolution problem. My main concern is that I don't want our implementation to close the door on advanced data fusion techniques. I'll keep you posted as I learn more.

Edit:

@J535D165 I re-read your earlier message, and realize that I had missed your comments on data fusion for deduplication, which is equivalent to my concerns about clustered candidate links. I agree that this question can be deferred until we find suitable algorithms for identifying clusters.

Starting the Fuse class, I will do my best to design it in an extensible way, so that we could add deduplication/cluster fusion in the future. I'm leaning towards separating these approaches into LinkFuse and ClusterFuse subclasses, where LinkFuse chooses/fuses individual candidate links (e.g. for linking two data frames) and ClusterFuse implements some cluster detection algorithm (e.g. for deduplication). FuseCore would contain conflict resolution functions which can resolve conflicts between an arbitrary number of values, and would be suitable to either approach.

J535D165 commented 7 years ago

Thanks for your thoughts about the Fuse part. Making an inherited Fuse class for linking and clustering sounds good. In my opinion, this is the best place to do the mapping.

For now, the most valuable additions are the fuse algorithms themselves. The current idea about the design of the Fuse class can be a good starting point.

jpweytjens commented 5 years ago

Has any progress been made on the Fuse class? If not, I would like to help where possible to add this functionality.

jpweytjens commented 5 years ago

I'm curious to what extent the NetworkX package can be used to implement some of the linking strategies?

For example, if one wants to link two databases in a one-to-one fashion, it might be possible to use the bipartite graph matching algorithms provided by NetworkX. I'm going to experiment with this and I will create a PR if this works.

Which other linking or deduplication strategies are interesting enough to implement (with help of the NetworkX package)?