dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
406 stars 214 forks source link

dedupe.core.BlockingError when running the sql example #75

Closed mehrrsaa closed 6 years ago

mehrrsaa commented 6 years ago

Hi,

I am performing dedupe on a 2 million rows dataset and so modified the sql example to be able to handle that. My approach:

The reason I modified the SQL snippets is that the GTID flag is enabled on my google cloud VM by force and does not accept some of the sql codes in the example script. This is a sample sql I changed (in case the error is because my sqls are messing something up):

The original: c.execute("CREATE TABLE plural_key " "(block_key VARCHAR(200), " " block_id INTEGER UNSIGNED AUTO_INCREMENT, " " PRIMARY KEY (block_id)) " "(SELECT MIN(block_key) FROM " " (SELECT block_key, " " GROUP_CONCAT(donor_id ORDER BY donor_id) AS block " " FROM blocking_map " " GROUP BY block_key HAVING COUNT(*) > 1) AS blocks " " GROUP BY block)")

My version: (CC_id is index column in my data)

cursor1.execute("CREATE TABLE plural_key " "(block_key VARCHAR(200), " " block_id INTEGER UNSIGNED AUTO_INCREMENT, " " PRIMARY KEY (block_id)) " "CHARACTER SET utf8 COLLATE utf8_unicode_ci") cursor1.execute("INSERT INTO plural_key " "(block_key)" "(SELECT MIN(block_key) FROM " " (SELECT block_key, " " GROUP_CONCAT(CC_id ORDER BY CC_id) AS block " " FROM blocking_map " " GROUP BY block_key HAVING COUNT(*) > 1) AS blocks " " GROUP BY block)")

My data is the exact data I sampled from for training. NaN values replaced with 'None', column names the same. Also, I have done the training on a sample and reloading and performing on all on another small example successfully.

So I a wondering what the issue could be. Any help for tracing the issue would be appreciated. I can provide more info or code snippets if necessary.

Thank you.

sontata commented 6 years ago

I wrote charset utf8 for each field in mysql_example. And now I have the same problem.

mehrrsaa commented 6 years ago

Hi, I am getting this error whether or not I include that. Were you successfully running it before adding those encodings? If yes then why did you add them?

Anyways, I think somehow something is messed up when preparing the blocking table after training. And I cannot think of anything, I have tried a lot if things. Hopefully the author of the example could give us a hint of what the problem could be.

mehrrsaa commented 6 years ago

66 is also still open and similar to this.

sontata commented 6 years ago

I had problem with charset by reason of Windows. Now, I run mysql_example in Linux and have dedupe.core.BlockingError again.

sontata commented 6 years ago

Hi! Change sql: cursor1.execute("CREATE TABLE plural_key " "(block_key VARCHAR(200), " " block_id INTEGER UNSIGNED AUTO_INCREMENT, " " PRIMARY KEY (block_id)) " "CHARACTER SET utf8 COLLATE utf8_unicode_ci") cursor1.execute("INSERT INTO plural_key " "(block_key)" "(SELECT MIN(block_key) as block_key FROM " " (SELECT block_key, " " GROUP_CONCAT(CC_id ORDER BY CC_id) AS block " " FROM blocking_map " " GROUP BY block_key HAVING COUNT(*) > 1) AS blocks " " GROUP BY block)")

mehrrsaa commented 6 years ago

Hi, So just add a "as block_key" to my sql? Why do you think that matters for this error?

sontata commented 6 years ago

Yes, just add. Check this column in moment of error, she is empty. I have not error now.

mehrrsaa commented 6 years ago

Unfortunately this did not solve my issue.

sontata commented 6 years ago

Hmm, I fix and work now. You should test example https://github.com/sontata/dedupe-examples/tree/fix_example/mysql_example

mehrrsaa commented 6 years ago

Thanks. I will compare. I am closing this issue since it seems to be only my problem now.

jwieder commented 6 years ago

I experienced the issue described by @mehrrsaa in a CentOS7 environment using MariaDB 10.11 & a vanilla Python 3.5.1 software collection & the latest version of the module (the issue occurred whether I manually compiled the module from github or installed with pip). The issue occurred or me without making any form of modifications to the example. To get the default example & dataset to work, I had to replace the default plural_key table create statement with the one provided by @sontata

c.execute("CREATE TABLE plural_key "
    "(block_key VARCHAR(200), "
    " block_id INTEGER UNSIGNED AUTO_INCREMENT, "
    " PRIMARY KEY (block_id)) "
    "CHARACTER SET utf8 COLLATE utf8_unicode_ci")
c.execute("INSERT INTO plural_key "
    "(block_key)"
    "(SELECT MIN(block_key) as block_key FROM "
    " (SELECT block_key, "
    " GROUP_CONCAT(CC_id ORDER BY CC_id) AS block "
    " FROM blocking_map "
    " GROUP BY block_key HAVING COUNT(*) > 1) AS blocks "
    " GROUP BY block)")

I also had to create the contributions database specifying the correct default collation type (this is not specified in the installation or README):

 CREATE DATABASE contributions CHARACTER SET utf8 COLLATE utf8_unicode_ci;

The query above specifies ut8_unicode_ci, however the sql example generates two more tables without specifying the collation type. Making these two changes allowed me to resolve outstanding issues.

It was pointed out to me that my version of MariaDB defaults to enabling the innodb_strict_mode (this began in 10.2.2), which is possibly why I encountered issues that others did not when running the vanilla example related to casting. Disabling strict_mode was not an option for me but might make things easier for others encountering this problem.

I also installed the three optional variable libraries specified here for implementing probablepeople/usaddresses-like normalization: https://docs.dedupe.io/en/latest/Variable-definition.html#optional-variables

reloaded001 commented 5 years ago

ok I have had some success, not entirely sure if this is supposed to be done this way. I found the issue to be in this piece of code:

clustered_dupes = deduper.matchBlocks(candidates_gen(c),
                                      threshold=0.5)

This is passing the results of the query to the candidates_gen(result_set). I tried to see what the datatype was in the C and it was set to <MySQLdb.cursors.SSDictCursor object at 0x000001D0DC988688>.

I then did a fetchall on c and assigned it to a variable like this:

resultq = c.fetchall() Then I changed the original from this:

clustered_dupes = deduper.matchBlocks(candidates_gen(c),
                                      threshold=0.5)

to this:

clustered_dupes = deduper.matchBlocks(candidates_gen(resultq),
                                      threshold=0.5)

and it worked. or at least it seems that it did.

So in summary, instead of passing candidates_gen() a cursor object, I passed it a set of dictionaries and i think it worked.

Please let me know if you are able to test this.

Many thanks.