anhaidgroup / py_entitymatching

BSD 3-Clause "New" or "Revised" License
183 stars 48 forks source link

Parallel Blocker's Output Candidate Dataframe's Format #168

Open EricLYunqi opened 1 month ago

EricLYunqi commented 1 month ago

When using parallel blocker (i.e. set n_jobs=-1) in "block_tables" function, the blocking candidate dataframe's row index is not properly adjusted. Row indexes are repeated rather than consecutive from 1 to the end.

I use following code to test and find this issue, it's on Ubuntu 22.04. Datasets are from sparkly: https://pages.cs.wisc.edu/~dpaulsen/sparkly_datasets/structured/dblp-acm.

import py_entitymatching as em
import pandas as pd
import networkx as nx

# load tables
tableA = pd.read_parquet('./table_a.parquet', engine='fastparquet')
tableB = pd.read_parquet('./table_b.parquet', engine='fastparquet')
gold = pd.read_parquet('./gold.parquet', engine='fastparquet')

# set keys
tableA.rename(columns={'_id': 'id'}, inplace=True)
tableB.rename(columns={'_id': 'id'}, inplace=True)
em.set_key(tableA, 'id')
em.set_key(tableB, 'id')

# load gold matches as graph
graph = nx.Graph()
row_indexs = list(gold.index)
for index in row_indexs:
    idA = str(gold.loc[index, 'id1']) + 'A'
    idB = str(gold.loc[index, 'id2']) + 'B'
    graph.add_edge(idA, idB)

# block
outattrA = list(tableA)[1:]
outattrB = list(tableB)[1:]
ob = em.OverlapBlocker()
# parallel
pC = ob.block_tables(tableA, tableB, 'title', 'title', 
                    word_level=True, overlap_size=4, 
                    l_output_attrs=outattrA, 
                    r_output_attrs=outattrB, 
                    allow_missing=False,
                    show_progress=False, 
                    n_jobs=-1)

pC.to_csv('./parallel_cand.csv', index=True)

# For me, the row index of pC repeats from 1 to 70

So, there may be some problems in calculating the recall by directly traversing the row index of the candidates