Open J535D165 opened 8 years ago
@jillianderson8 and myself have been working on designing a new Fuse
class which would act as a tool for merging datasets. The basic idea is as follows:
Compare.vectors
, dfA
, and dfB
.Compare.vectors
using corresponding data from dfA
and dfB
.We envision the Fuse
class to be implemented in a way that is very similar to the Compare
class. We have created some draft code showing how we think this class would be implemented (see this Gist). Here's what we think the Fuse
API would look like, assuming we're extending the recordlinkage
0.9.0 API:
# Original Data
my_df_a: pd.DataFrame
my_df_b: pd.DataFrame
# Do core data integration analysis
my_comp: rl.Compare
my_classifier: rl.Classifier
# Refine your pairs
my_vectors = my_comp.vectors.iloc[my_classifier.predict(my_comp.vectors)]
# Perform Data fusion
my_fuse = Fuse(my_vectors, my_comp.df_a, my_comp.df_b)
my_fuse.no_gossiping('col1', 'col2')
my_fuse.no_gossiping('col3', 'col4')
my_fuse.trust_your_friends('col5', 'col6', trusted='a')
my_fuse.take_the_information('col7', 'col8')
# Get output
my_integrated_data = my_fuse.fuse()
We plan to implement this feature soon on the networks-lab/recordlinkage
branch, to keep up with our lab's internal data integration needs. However, we would appreciate your input on our proposed design, since would like to eventually submit another pull request to integrate this feature into the main recordlinkage
repository. We'd be particularly interested in how to design the Fuse
class in a way that can eventually be included in the potentially upcoming Pipeline
feature. One big question is whether the Pipeline
class could facilitate moving from classification to fusion.
I will be gone until next Monday, so won't respond immediately, but I'm looking forward to hearing your thoughts!
This sounds very cool. Would love to integrate this in the toolkit.
The following is a write-up of my thoughts. I am open for all ideas.
As told in PR #37, there is a change of the Compare API coming. The Index API has changed already. My goal is to convert each class into a structure similar to the Estimator class in Tensorflow and SKLearn (see this paper https://arxiv.org/pdf/1309.0238.pdf). For the Fuse class, I think this should be like this:
# Initialise Data fusion (No computations or fusing)
my_fuse = Fuse()
my_fuse.no_gossiping('col1', 'col2')
my_fuse.no_gossiping('col3', 'col4')
my_fuse.trust_your_friends('col5', 'col6', trusted='a')
my_fuse.take_the_information('col7', 'col8')
# do the computation and of all the steps above
my_integrated_data = my_fuse.fuse(my_vectors, my_comp.df_a, my_comp.df_b)
In fact, this is not a huge difference. The backend remains almost the same. Please note that all computations are performed when calling the .fuse(...)
method. All steps before are used to configure the Fuse class. Hope this sounds clear :) This weekend I will push the new Compare class to the branch new-compare-api
where this is implemented.
The advantages are:
pandas.read_csv(..., chunksize=10000)
while using the same class. I am not very familiar with the fusion techniques. The techniques seem to be very useful.
One-to-many matches and many-to-many matches are a great struggle during the fusion process. Maybe an option to enforce one-to-one matching (greedy approach) would be great. This is only useful when linking datasets.
For deduplication, you often see that multiple records belong to the same entity, while they do not form a complete graph. One of my goals is to find a good algorithm to deal with this. For example, record A, B, C, and D form a complete graph except from the fact that one edge is not classified as a match. So we do have 7 edges instead of 8 (ratio 0.875). I would like to have a method which can be set at 0.85 for example. However, this is not very urgent at the moment.
Thanks for your input, this is very helpful. Your proposed "initialize first, compute second" model is what I expected based on the Pipeline
class you described.
I very much agree that enforcing one-to-one or one-to-many mappings are important. I implemented something similar in my own data fusion workflow, which basically consisted of three steps/functions:
Classifier
, etc.).@jillianderson8 and I had decided not to include the ranking/refining stages in the Fuse
class, because we figured that restricting its scope to "fuse and resolve conflicts" would be higher cohesion. On the other hand, we think that enforcing mapping is a very important feature, and we would defer to you if you think it fits in the Fuse
class.
Going forward, I'll plan to implement Fuse
as we've discussed. For now, I will assume that we will include mapping enforcement in Fuse
and design the internal interface to support this, and wait to hear whether you think that mapping enforcement fits in Fuse
.
@J535D165 As I've done more thinking about this, I've realized that this proposed workflow would not handle the case where one wants to integrate a cluster of candidate links into the same record in the fused data set? I have found a number of good resources (e.g. http://ilpubs.stanford.edu:8090/856/1/2007-24.pdf, http://www.bioinf.at/teaching/ss2011/pr-pr5h/data_fusion_a1-bleiholder.pdf, https://edoc.hu-berlin.de/bitstream/handle/18452/3112/197.pdf?sequence=1&isAllowed=y) and so will do more reading before I begin to better understand the entity resolution problem. My main concern is that I don't want our implementation to close the door on advanced data fusion techniques. I'll keep you posted as I learn more.
Edit:
@J535D165 I re-read your earlier message, and realize that I had missed your comments on data fusion for deduplication, which is equivalent to my concerns about clustered candidate links. I agree that this question can be deferred until we find suitable algorithms for identifying clusters.
Starting the Fuse
class, I will do my best to design it in an extensible way, so that we could add deduplication/cluster fusion in the future. I'm leaning towards separating these approaches into LinkFuse
and ClusterFuse
subclasses, where LinkFuse
chooses/fuses individual candidate links (e.g. for linking two data frames) and ClusterFuse
implements some cluster detection algorithm (e.g. for deduplication). FuseCore
would contain conflict resolution functions which can resolve conflicts between an arbitrary number of values, and would be suitable to either approach.
Thanks for your thoughts about the Fuse part. Making an inherited Fuse class for linking and clustering sounds good. In my opinion, this is the best place to do the mapping.
For now, the most valuable additions are the fuse algorithms themselves. The current idea about the design of the Fuse class can be a good starting point.
Has any progress been made on the Fuse class? If not, I would like to help where possible to add this functionality.
I'm curious to what extent the NetworkX
package can be used to implement some of the linking strategies?
For example, if one wants to link two databases in a one-to-one fashion, it might be possible to use the bipartite graph matching algorithms provided by NetworkX
. I'm going to experiment with this and I will create a PR if this works.
Which other linking or deduplication strategies are interesting enough to implement (with help of the NetworkX
package)?
After successfully linking two dataframes, there is often a need to make one unified dataframe. There should be a tool to combine the information of both dataframes into one. I think, one tool might not be enough to cover all use cases. So a set of tools might be a good idea. Idea's are welcome.
Similar tools are needed for deduplication. A new issue will be made for that case.