Open sarbaniAi opened 2 years ago
it's pretty hard to say, but your hardware seems like it should be more than enough for 40K records
do you know where in the process you get the memory error.
Following this thread. We are facing a similar scalabiltiy challenge with Dedupe.
we are now using postgresql approach -> Refer : https://github.com/dedupeio/dedupe-examples/tree/master/pgsql_big_dedupe_example
Version used - 2.0.13
Total records 18K with 16 core, 64 GIG RAM its is taking 20 mins to run along with manual labelling without any memory crash.
First issue version 2.0.14 is throwing error on some compatibility issue (discussed here on different threads)
Also 2.0.14 was giving slow performance ..
If you running with > 10K data postgresql will give better performance . Now we are targeting for 3million records.
About 100K is probably when it makes sense to move to database.
Also 2.0.14 was giving slow performance ..
Could you speak about what you were seeing with this?
Sure fgreff. Background :
Have tried the following approaches so far:
Created a graph on the pairs given by the pair function and then used get connected components to send only the pairs that are connected. Still based on the data, we have huge potential pairs (~ 87000) in 100K records
Turned on the option to run the pairs in Memory
Next Test:
Based on your experience, please let me know if there are other options i can explore. Compute and memory are not an issue, and the ultimate goal is to scale the solution to 5 million plus records.
Warning Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements.
Today we observed some accuracy concerns..(we are using postgresql approach). Lot of duplicates ara tagged as unique.. we are using first_name, last_name, gender, marital_status, ssn_last_4, race, ethnicity, phone, date_of_birth, address for dedupe logic. There are many records are identical on all the fields except address or all are identical except phone or ethnicity.
We are expecting to see them as duplicate with <1 confidence score. There are records which are duplicate with 98 or 99% confidence where mismatch is there on such fields. But still lot of such records are missed to tag duplicate.
I labelled such cases when I was manually training them. I trained ~ 100 records out of18K.
Where to change the code to fix such logic? I thought training should take care .
you'll need to get more training data.
We are seeing this issue consistently-> eg I want records with "same fname, lname, date-of-birth" or "same fname, lname, ssn" will be labeled as duplicate. Here I dont want to put any weightage on address or phone number.
I am using fname, lname, ssn, date-of-birth, address, phone as fields for dedupe
I trained approx 100 rercords out of 3K ,
But results are showing lot of records with "same fname, lname, date-of-birth" or "same fname, lname, ssn" as unique. Some of them are having all fields are same. few are different on phone or address.
Very few are showing as duplicates. How to fix this issue?
How can we set the weightage on fname/la\name/SSN or fname/lname/date-of-birth and then show address /phone
@abhijeetghotra, would it be possible to share the pairwise edges (the memmaped numpy array) that should be completely without any user-sensitive data.
I'm using "AWS Sagemaker instance -> ml.r5.4xlarge 16vCPU 128 GiB RAM". I need to dedupe 5.8 million records but it's taking forever. Due to this scalability issue, I might need to shift to other library. I need some solution by Monday. Can anyone please help me with this?
Should I shift to postgreSQL for this? I have monitored the memory usage around 40% RAM is always free. So, will shifting to postgreSQL help in this case?
@fgregg It would be very helpful, if you can comment on this.
@MokshaVora yes please use postgresql , it is quite fast as operations are running in database. our code is running on postgresql.
But I am still working on dedupe as the accuracy is not coming up as good, need to understand how can we improve the accuracy. How is this block-key formed? Only Active Learning is exposed so a bit struggling to tune the model.
Please connect on linkedin - I want understand the dedupe internal processing to improve accuracy/recall in our case. https://www.linkedin.com/in/sarbani-maiti-35b89111/
I am using python dedupe library for large dataset. Initially we tested with 3K dataset and it could run with "AWS Sagemaker instance -> ml.r5.4xlarge 16vCPU 128 GiB RAM" within few mins. The moment we increased the dataset to 40K it crashed with this machine.
Q: 1 -> Is it usual? Do we need to optimize anything in the library?
Q: 2-> Now we are trying postgresql based approach. How much memory do we need to run the example? Like - core/memory/Training time ? We are getting memory error with 8 core , 32 GM RAM.
Appreciate help in this matter as we are using for our production system and need to finalize the approach. This is a great library. We need to resolved the memory error .
Thanks Sarbani