funginstitute / disambiguator

Other
28 stars 17 forks source link

Over-consolidation of Inventor ID #3

Open laironald opened 11 years ago

laironald commented 11 years ago

I performed a simple check to test inventor names. 0 = i detect a middle name conflict 1 = a name is matched against a name without a middle name 2 = the names contain a middle name and the middle initial matches

nstr invs patents avgpats = patents/invs


0 8,644 176,307 20.39
1 1,611,032 6,279,199 3.89
2 1,480,296 3,955,163 2.67

For example: The nstr=0 includes the following (first 10 entries):

(inv_id, #patents, unique names clumped together) 03858572-2|31|JOHN F DYE,JOHN DYE,JOHN D DYE 03858760-1|45|ANTONIN GONCALVES,ANTONIN L GONCALVES,ANTONIN C GONCALVES 03858787-3|19|ROGER M FLOYD,ROGER N FLOYD 03859063-2|8|STEVEN I TAUB,STEVEN L TAUB 03859092-1|42|HENRY J GYSLING,HENRY L GYSLING,HENRY JAMES GYSLING 03859097-1|4|FREDRICK L HAMB,FREDERICK L HAMB,FREDERICK T HAMB,FREDERICK D HAMB 03859113-2|18|WILLIAM C STUMPHAUZER,WILLIAM S STUMPHAUZER 03859119-1|316|JAMES C FLETCHER,JAMES ADMINISTR FLETCHER,JAMES CORVIN FLETCHER,J CLINT FLETCHER,J CLINTON FLETCHER 03859298-1|72|JOHN H SELLSTEDT,JOHN H SELLSTED,JOHN M SELLSTEDT 03859356-1|109|WILLIAM J HOULIHAN,WILLIAM H HOULIHAN

As you can see here, the first record --

While this is a relatively small % of all inventors identified -- the avgpats for these individuals is extremely high compared to the others. I've run into these individuals when creating networks and they create some strange networks! That said, visually observing the data also presents some interesting blocking mechanisms for further disambiguation which I would love to share. I think the more we show these results visually via APIs, some data issues may become obvious.

doolin commented 11 years ago

Please post the code which exposed the bug so we can reproduce it. Thanks.