funginstitute / disambiguator

Other
28 stars 17 forks source link

Incorrect assignment of inventor IDs #2

Open doolin opened 11 years ago

doolin commented 11 years ago

I've been working with the Harvard Patent Dataverse 2010 datasets for quite a while now and have stumbled across an issue with unique inventor identification for records with assignee numbers starting with A or H (e.g. H000000000158 for Shell Oil Company instead of the regular 10266734). The algorithm seems to incorrectly assign different inventor ID's to records with such assignee numbers, while the other characteristics of the record are very similar or exactly the same as the records listing the 'regular' assignee number.

Here's an example for one of Shell's key inventors:

HAROLD J VINEGAR BELLAIRE US 7631690 SHELL OIL COMPANY 10266734 166 04359687-1 2009 HAROLD J VINEGAR BELLAIRE US 7635023 SHELL OIL COMPANY 10266734 166 04359687-1 2009 HAROLD J VINEGAR BELLAIRE US 7635025 SHELL OIL COMPANY 10266734 166 04359687-1 2009

As you can see, these are OK. The inventor is correctly assigned with Invnum 04359687-1. However, the following records receive a different Invnum, while the inventor is of course the same based on the characteristics of the other data fields:

HAROLD J VINEGAR BELLAIRE US 7640980 SHELL OIL COMPANY H000000000158 166-268/166-302/166-369/405-52 07640980-0 2010 HAROLD J VINEGAR BELLAIRE US 7735935 SHELL OIL COMPANY H000000000158 299-5/166-2721/166-302/299-4 07735935-0 2010 HAROLD J VINEGAR BELLAIRE US 7681647 SHELL OIL COMPANY H000000000158 166-302/166-369 07681647-2 2010 For larger selections of data, this leads to a lot of missing connections and overall less connected or dense networks than is actually the case. So far, I've manually corrected the Invnum's for these records, but of course this is not the way to go for selections containing thousands of records ;-)

Would it be possible to address this issue in the next release of the datasets? Please let me know if there's any other info I can provide to further clarify this issue.

Thanks,

André

doolin commented 11 years ago

Cross-listed with https://github.com/funginstitute/patentprocessor/issues/2 because it's not clear where the problem arises.