What is the system core and memory benchmark for dedupe library for bigdata

sarbaniAi commented 2 years ago

I am using python dedupe library for large dataset. Initially we tested with 3K dataset and it could run with "AWS Sagemaker instance -> ml.r5.4xlarge 16vCPU 128 GiB RAM" within few mins. The moment we increased the dataset to 40K it crashed with this machine.

Q: 1 -> Is it usual? Do we need to optimize anything in the library?

Q: 2-> Now we are trying postgresql based approach. How much memory do we need to run the example? Like - core/memory/Training time ? We are getting memory error with 8 core , 32 GM RAM.

Appreciate help in this matter as we are using for our production system and need to finalize the approach. This is a great library. We need to resolved the memory error .

Thanks Sarbani

fgregg commented 2 years ago

it's pretty hard to say, but your hardware seems like it should be more than enough for 40K records

fgregg commented 2 years ago

do you know where in the process you get the memory error.

abhijeetghotra commented 2 years ago

Following this thread. We are facing a similar scalabiltiy challenge with Dedupe.

Machine Size : 144GB 72 Cores
50K - 60 K get processed within 15 mins (in memory)
~ 100K, we start seeing memory issues
(additional observations) https://stackoverflow.com/questions/72356712/python-dedupe-library-for-bigdata?noredirect=1#comment127827864_72356712

sarbaniAi commented 2 years ago

we are now using postgresql approach -> Refer : https://github.com/dedupeio/dedupe-examples/tree/master/pgsql_big_dedupe_example
Version used - 2.0.13 Total records 18K with 16 core, 64 GIG RAM its is taking 20 mins to run along with manual labelling without any memory crash.

First issue version 2.0.14 is throwing error on some compatibility issue (discussed here on different threads)

Also 2.0.14 was giving slow performance ..

If you running with > 10K data postgresql will give better performance . Now we are targeting for 3million records.

fgregg commented 2 years ago

About 100K is probably when it makes sense to move to database.

Also 2.0.14 was giving slow performance ..

Could you speak about what you were seeing with this?

abhijeetghotra commented 2 years ago

Sure fgreff. Background :

We trained the dedupe and generated a settings file.
Running the staticDedupe object to load the settings files and provided a dataset of 100K with 7 fields for matching
Given the large size, we generated the pairs and score using the prescribed functions. for 100K records we has roughly 3.7 million pairs
Program runs perfectly and pretty fast (~ 15 mins) through the pair and scoring process
When we run the cluster on the (3.7 million pairs), we have the following observation: a. The process is running the cluster seems to take forever, we shut it down after 12 hours. (No CPU utilized spikes are observed as seen when the program was processing and generating the 3.7million pairs) b. We get warnings: from Clustering.PY file. (See below)

My understanding is that, as cluster uses (N^2) to evaluate pairs, the cluster size is getting huge (89927) as this is greater than the fixed limit of max_components = 30,000. The program is splitting the cluster. However in this process, the program seems to go in a continuous loop. (hung state)

Have tried the following approaches so far:

Created a graph on the pairs given by the pair function and then used get connected components to send only the pairs that are connected. Still based on the data, we have huge potential pairs (~ 87000) in 100K records
Turned on the option to run the pairs in Memory

Next Test:

Update the code to change the max_components to 100K from 30K and see the performance

Based on your experience, please let me know if there are other options i can explore. Compute and memory are not an issue, and the ultimate goal is to scale the solution to 5 million plus records.

Warning Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements.

sarbaniAi commented 2 years ago

Today we observed some accuracy concerns..(we are using postgresql approach). Lot of duplicates ara tagged as unique.. we are using first_name, last_name, gender, marital_status, ssn_last_4, race, ethnicity, phone, date_of_birth, address for dedupe logic. There are many records are identical on all the fields except address or all are identical except phone or ethnicity.

We are expecting to see them as duplicate with <1 confidence score. There are records which are duplicate with 98 or 99% confidence where mismatch is there on such fields. But still lot of such records are missed to tag duplicate.

I labelled such cases when I was manually training them. I trained ~ 100 records out of18K.

Where to change the code to fix such logic? I thought training should take care .

fgregg commented 2 years ago

you'll need to get more training data.

sarbaniAi commented 2 years ago

We are seeing this issue consistently-> eg I want records with "same fname, lname, date-of-birth" or "same fname, lname, ssn" will be labeled as duplicate. Here I dont want to put any weightage on address or phone number.

I am using fname, lname, ssn, date-of-birth, address, phone as fields for dedupe

I trained approx 100 rercords out of 3K ,

But results are showing lot of records with "same fname, lname, date-of-birth" or "same fname, lname, ssn" as unique. Some of them are having all fields are same. few are different on phone or address.

Very few are showing as duplicates. How to fix this issue?

How can we set the weightage on fname/la\name/SSN or fname/lname/date-of-birth and then show address /phone

fgregg commented 2 years ago

@abhijeetghotra, would it be possible to share the pairwise edges (the memmaped numpy array) that should be completely without any user-sensitive data.

MokshaVora commented 2 years ago

I'm using "AWS Sagemaker instance -> ml.r5.4xlarge 16vCPU 128 GiB RAM". I need to dedupe 5.8 million records but it's taking forever. Due to this scalability issue, I might need to shift to other library. I need some solution by Monday. Can anyone please help me with this?

Should I shift to postgreSQL for this? I have monitored the memory usage around 40% RAM is always free. So, will shifting to postgreSQL help in this case?

@fgregg It would be very helpful, if you can comment on this.

sarbaniveersatech commented 2 years ago

@MokshaVora yes please use postgresql , it is quite fast as operations are running in database. our code is running on postgresql.

But I am still working on dedupe as the accuracy is not coming up as good, need to understand how can we improve the accuracy. How is this block-key formed? Only Active Learning is exposed so a bit struggling to tune the model.

Please connect on linkedin - I want understand the dedupe internal processing to improve accuracy/recall in our case. https://www.linkedin.com/in/sarbani-maiti-35b89111/

dedupeio / dedupe

What is the system core and memory benchmark for dedupe library for bigdata #1024