Closed doolin closed 11 years ago
Could this be a problem in the downloaded ZIP files? I ran a couple tests with hyphenated strings as the inventor's first and last name set in the XML file and the parser seemed to preserve them. I'm running the tables through clean.py
and consolidate.py
to see if the hyphens disappear at any point in the process.
It struck me that this is also similar to issue #9, to the extent that they both concern hyphens disappearing
The hyphen is being dropped in the clean.py script in the call to the lib/fwork.ascit function, specifically line 252. It removes everything from the string that isn't [A-Za-z0-9 ]
.
IIRC, there is more than one ascit function in the code base.
On Wed, Feb 13, 2013 at 3:45 PM, Gabe Fierro notifications@github.comwrote:
The hyphen is being dropped in the lib/fwork.ascithttps://github.com/funginstitute/patentprocessor/blob/master/lib/fwork.py#L241-L266function, specifically line 252https://github.com/funginstitute/patentprocessor/blob/master/lib/fwork.py#L252
— Reply to this email directly or view it on GitHubhttps://github.com/funginstitute/patentprocessor/issues/10#issuecomment-13525407.
primus inter parse
I know there's more than one uniasc function. A grep search of the latest pull of patentprocessor only reveals one ascit function, in lib/fwork.py
Then we have the conundrum of why one name is getting filtered through ascit, and the other, apparently not.
Looks like ascit gets applied to all the relevant fields: https://github.com/funginstitute/patentprocessor/blob/master/lib/fwork.py#L241-L266
My only guess is that one XML document has the hyphen and the other doesn't (which is where the space is coming from?)
We'll need to examine the actual XML documents then.
It might be faster to compute which weekly file a patent is in, then parse the particular patent out of that file.
Or maybe grep it.
Hope you aren't doing anything important on the server, I'm grepping for patent numbers right now.
Here's the links two patents as they are on google:
And here are the related zip files:
I found a good way of finding them w/o using the couch database; I'll script it up and push it some time today.
Yeah, it's probably easier just to extract as needed.
Looks like they're both hyphenated...
$ ack -i "kin-joe"
ipg100824.xml
5039107:<first-name>Kin-Joe</first-name>
ipg120327.xml
8485:<first-name>Kin-Joe</first-name>
$ ack -i "kin joe"
$
That's actually good, because it means it's a bug or a mis-design on our end, hence, fixable.
Not like citations, where the web page is updated but the xml is not.
The test for this fix needs to be committed before or with the fix.
I can't reproduce the error...I ran the two downloaded XML files through the preprocessor and ended up with
sqlite> select * from invpat where (Firstname = "KINJOE" or Firstname = "KIN JOE");
KINJOE|KINJOE|SHAM||SHOREVIEW|MN|US|55126|45.080064|-93.137722|0|D0656308|2010|2010|2012|2010-11-18|ORTHOCOR MEDICAL INC|H000000000896|D3-2031||D0656308-0|D0656308-0|D0656308-0
KINJOE|KINJOE|SHAM||SHOREVIEW|MN|US|55126|45.080064|-93.137722|1|07783348|2008|2008|2010|2008-04-16|ORTHOCOR MEDICAL INC|H000000000896|607-3/602-2/600-15||07783348-1|07783348-1|07783348-1
Houston, we have a conundrum.
On Fri, Feb 15, 2013 at 1:01 PM, Gabe Fierro notifications@github.comwrote:
I can't reproduce the error...I ran the two downloaded XML files through the preprocessor and ended up with
sqlite> select * from invpat where (Firstname = "KINJOE" or Firstname = "KIN JOE"); KINJOE|KINJOE|SHAM||SHOREVIEW|MN|US|55126|45.080064|-93.137722|0|D0656308|2010|2010|2012|2010-11-18|ORTHOCOR MEDICAL INC|H000000000896|D3-2031||D0656308-0|D0656308-0|D0656308-0 KINJOE|KINJOE|SHAM||SHOREVIEW|MN|US|55126|45.080064|-93.137722|1|07783348|2008|2008|2010|2008-04-16|ORTHOCOR MEDICAL INC|H000000000896|607-3/602-2/600-15||07783348-1|07783348-1|07783348-1
— Reply to this email directly or view it on GitHubhttps://github.com/funginstitute/patentprocessor/issues/10#issuecomment-13627825.
primus inter parse
In this email from Jill, she's looking at a file called full.sqlite3
. Where does this file get generated? It might be worth it to regenerate that file from scratch and see if we still have the error. I spend most of yesterday looking through code and running one-off tests, and I still cannot reproduce the error where we get "KIN JOE" and "KINJOE".
Okay, it's not looking so good in full.sqlite3 in jan292013: some rows have location data and others don't (for the same inventor):
sqlite> select firstname, lastname, street, city, state, country ...> from invpat ...> where ((firstname = "KINJOE" or firstname = "KIN JOE") and lastname = "SHAM") ...> or (firstname = "ALEXANDER" and lastname = "GRUBER") ...> or (firstname = "OLIVER" and lastname = "RENELT");
Firstname|Lastname|Street|City|State|Country OLIVER|RENELT||HAMBURG||DE ALEXANDER|GRUBER||SANKT GEORGEN||DE KIN JOE|SHAM||SHOREVIEW|MN|US KINJOE|SHAM|||| OLIVER|RENELT|||| ALEXANDER|GRUBER||||
full.sqlite
is typically the name I apply to an invpat db which is ready for disambiguation. As for which full.sqlite3
she is using, don't know.
This db does get regenerated for every new disambiguation. It's partially scripted.
I added a small unittest in 6ffd255 to make sure that running the ascit
function with default parameters removes the hyphen in strings. I'm closing the issue on the assertion that we will be ignoring hyphens in names; having the hyphen in the names would probably help the disambiguator and is technically more "correct" in terms of parsing, but I'm not sure whether including the hyphen would break items further along in the toolchain. In fact, the ascit
function takes an optional flag strict
that if set to False seemingly doesn't remove periods or hyphens or other punctuation marks (e.g. those that might occur with initials). Additionally, I am unable to reproduce the original error in this issue. We can reopen this at a later date if we decide we want to handle punctuation in names differently.
Additionally, one of the KIN-JOE files will be available at test/fixtures/xml/ipg100824-hyphenated.xml
Excellent, thanks.
Is KIN JOE the same person as KINJOE?
KIN JOE|SHAM||SHOREVIEW|MN|US|07783348-1 http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&f=G&l=50&d=PTXT&S1=7783348.PN.&OS=PN/7783348&RS=PN/7783348
KINJOE|SHAM|||||D0656308-0 http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&f=G&l=50&d=PTXT&S1=D656308.PN.&OS=PN/D656308&RS=PN/D656308
Both patents show Kin-Joe as inventor.