funginstitute / patentprocessor

BSD 2-Clause "Simplified" License
68 stars 31 forks source link

Handling hyphenated names #10

Closed doolin closed 11 years ago

doolin commented 11 years ago

Is KIN JOE the same person as KINJOE?

KIN JOE|SHAM||SHOREVIEW|MN|US|07783348-1 http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&f=G&l=50&d=PTXT&S1=7783348.PN.&OS=PN/7783348&RS=PN/7783348

KINJOE|SHAM|||||D0656308-0 http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&f=G&l=50&d=PTXT&S1=D656308.PN.&OS=PN/D656308&RS=PN/D656308

Both patents show Kin-Joe as inventor.

gtfierro commented 11 years ago

Could this be a problem in the downloaded ZIP files? I ran a couple tests with hyphenated strings as the inventor's first and last name set in the XML file and the parser seemed to preserve them. I'm running the tables through clean.py and consolidate.py to see if the hyphens disappear at any point in the process.

gtfierro commented 11 years ago

It struck me that this is also similar to issue #9, to the extent that they both concern hyphens disappearing

gtfierro commented 11 years ago

The hyphen is being dropped in the clean.py script in the call to the lib/fwork.ascit function, specifically line 252. It removes everything from the string that isn't [A-Za-z0-9 ].

doolin commented 11 years ago

IIRC, there is more than one ascit function in the code base.

On Wed, Feb 13, 2013 at 3:45 PM, Gabe Fierro notifications@github.comwrote:

The hyphen is being dropped in the lib/fwork.ascithttps://github.com/funginstitute/patentprocessor/blob/master/lib/fwork.py#L241-L266function, specifically line 252https://github.com/funginstitute/patentprocessor/blob/master/lib/fwork.py#L252

— Reply to this email directly or view it on GitHubhttps://github.com/funginstitute/patentprocessor/issues/10#issuecomment-13525407.

primus inter parse

gtfierro commented 11 years ago

I know there's more than one uniasc function. A grep search of the latest pull of patentprocessor only reveals one ascit function, in lib/fwork.py

doolin commented 11 years ago

Then we have the conundrum of why one name is getting filtered through ascit, and the other, apparently not.

gtfierro commented 11 years ago

Looks like ascit gets applied to all the relevant fields: https://github.com/funginstitute/patentprocessor/blob/master/lib/fwork.py#L241-L266

My only guess is that one XML document has the hyphen and the other doesn't (which is where the space is coming from?)

doolin commented 11 years ago

We'll need to examine the actual XML documents then.

It might be faster to compute which weekly file a patent is in, then parse the particular patent out of that file.

Or maybe grep it.


Hope you aren't doing anything important on the server, I'm grepping for patent numbers right now.

gtfierro commented 11 years ago

Here's the links two patents as they are on google:

And here are the related zip files:

I found a good way of finding them w/o using the couch database; I'll script it up and push it some time today.

doolin commented 11 years ago

Yeah, it's probably easier just to extract as needed.

gtfierro commented 11 years ago

Looks like they're both hyphenated...

$ ack -i "kin-joe"
ipg100824.xml
5039107:<first-name>Kin-Joe</first-name>

ipg120327.xml
8485:<first-name>Kin-Joe</first-name>

$ ack -i "kin joe"
$
doolin commented 11 years ago

That's actually good, because it means it's a bug or a mis-design on our end, hence, fixable.

Not like citations, where the web page is updated but the xml is not.

The test for this fix needs to be committed before or with the fix.

gtfierro commented 11 years ago

I can't reproduce the error...I ran the two downloaded XML files through the preprocessor and ended up with

sqlite> select * from invpat where (Firstname = "KINJOE" or Firstname = "KIN JOE");
KINJOE|KINJOE|SHAM||SHOREVIEW|MN|US|55126|45.080064|-93.137722|0|D0656308|2010|2010|2012|2010-11-18|ORTHOCOR MEDICAL INC|H000000000896|D3-2031||D0656308-0|D0656308-0|D0656308-0
KINJOE|KINJOE|SHAM||SHOREVIEW|MN|US|55126|45.080064|-93.137722|1|07783348|2008|2008|2010|2008-04-16|ORTHOCOR MEDICAL INC|H000000000896|607-3/602-2/600-15||07783348-1|07783348-1|07783348-1
doolin commented 11 years ago

Houston, we have a conundrum.

On Fri, Feb 15, 2013 at 1:01 PM, Gabe Fierro notifications@github.comwrote:

I can't reproduce the error...I ran the two downloaded XML files through the preprocessor and ended up with

sqlite> select * from invpat where (Firstname = "KINJOE" or Firstname = "KIN JOE"); KINJOE|KINJOE|SHAM||SHOREVIEW|MN|US|55126|45.080064|-93.137722|0|D0656308|2010|2010|2012|2010-11-18|ORTHOCOR MEDICAL INC|H000000000896|D3-2031||D0656308-0|D0656308-0|D0656308-0 KINJOE|KINJOE|SHAM||SHOREVIEW|MN|US|55126|45.080064|-93.137722|1|07783348|2008|2008|2010|2008-04-16|ORTHOCOR MEDICAL INC|H000000000896|607-3/602-2/600-15||07783348-1|07783348-1|07783348-1

— Reply to this email directly or view it on GitHubhttps://github.com/funginstitute/patentprocessor/issues/10#issuecomment-13627825.

primus inter parse

gtfierro commented 11 years ago

In this email from Jill, she's looking at a file called full.sqlite3. Where does this file get generated? It might be worth it to regenerate that file from scratch and see if we still have the error. I spend most of yesterday looking through code and running one-off tests, and I still cannot reproduce the error where we get "KIN JOE" and "KINJOE".

Okay, it's not looking so good in full.sqlite3 in jan292013: some rows have location data and others don't (for the same inventor):

sqlite> select firstname, lastname, street, city, state, country ...> from invpat ...> where ((firstname = "KINJOE" or firstname = "KIN JOE") and lastname = "SHAM") ...> or (firstname = "ALEXANDER" and lastname = "GRUBER") ...> or (firstname = "OLIVER" and lastname = "RENELT");

Firstname|Lastname|Street|City|State|Country OLIVER|RENELT||HAMBURG||DE ALEXANDER|GRUBER||SANKT GEORGEN||DE KIN JOE|SHAM||SHOREVIEW|MN|US KINJOE|SHAM|||| OLIVER|RENELT|||| ALEXANDER|GRUBER||||

doolin commented 11 years ago

full.sqlite is typically the name I apply to an invpat db which is ready for disambiguation. As for which full.sqlite3 she is using, don't know.

This db does get regenerated for every new disambiguation. It's partially scripted.

gtfierro commented 11 years ago

I added a small unittest in 6ffd255 to make sure that running the ascit function with default parameters removes the hyphen in strings. I'm closing the issue on the assertion that we will be ignoring hyphens in names; having the hyphen in the names would probably help the disambiguator and is technically more "correct" in terms of parsing, but I'm not sure whether including the hyphen would break items further along in the toolchain. In fact, the ascit function takes an optional flag strict that if set to False seemingly doesn't remove periods or hyphens or other punctuation marks (e.g. those that might occur with initials). Additionally, I am unable to reproduce the original error in this issue. We can reopen this at a later date if we decide we want to handle punctuation in names differently.

Additionally, one of the KIN-JOE files will be available at test/fixtures/xml/ipg100824-hyphenated.xml

doolin commented 11 years ago

Excellent, thanks.