funginstitute / patentprocessor

BSD 2-Clause "Simplified" License
69 stars 31 forks source link

patent parsing issues (i believe it is related to sub-class) #27

Closed laironald closed 11 years ago

laironald commented 11 years ago
-rw-r--r-- 1 root     root      367210 Jul 11 19:41 ipg050308.xml
-rw-r--r-- 1 root     root      161166 Jul 11 19:42 ipg050315.xml
-rw-r--r-- 1 root     root      399011 Jul 11 19:44 ipg050322.xml
-rw-r--r-- 1 root     root      234760 Jul 11 19:46 ipg050405.xml

The files above appear to have parsing issues. Here is an error that shows up:

sqlalchemy.exc.IntegrityError: (IntegrityError) (1062, "Duplicate entry '200/2,MERHOWINDUSTRI' for key 'PRIMARY'") 'INSERT INTO subclass (id, title, text) VALUES (%s, %s, %s)' ('200/2,MERHOWINDUSTRIES-ADD.', None, None)

This is an awkward parse for many reasons. IE. 200/2,MERHOW etc doesn't look like a subclass_id key.

gtfierro commented 11 years ago

So it's an awkward parse, but the error itself is because it's a duplicate entry?

I'll look into fixing the parse. Is the error an issue here?

gtfierro commented 11 years ago

This seems like a prevalent issue:

sqlite> select * from class where SubClass like '%MER%' limit 10;
6863030|0|200|2,MERHOWINDUSTRIES-ADD.
6863031|0|200|2,MERHOWINDUSTRIES-ADD.
6863032|0|200|2,MERHOWINDUSTRIES-ADD.
6863033|0|200|2,MERHOWINDUSTRIES-ADD.
6863034|0|200|2,MERHOWINDUSTRIES-ADD.
6863035|0|200|2,MERHOWINDUSTRIES-ADD.
6863036|0|200|2,MERHOWINDUSTRIES-ADD.
6863037|0|200|2,MERHOWINDUSTRIES-ADD.
6863038|0|200|2,MERHOWINDUSTRIES-ADD.
6863039|0|200|2,MERHOWINDUSTRIES-ADD.
sqlite> select count(*) from class where SubClass like '%MER%';
5213
gtfierro commented 11 years ago

Based on the strings, it looks like some of the <othercit> tag contents are being pulled into classes. Still looking into it.

laironald commented 11 years ago

cool. yeah that other file is suspiciously wrong i would say. :D

On Thu, Jul 11, 2013 at 3:54 PM, Gabe Fierro notifications@github.comwrote:

Based on the strings, it looks like some of the tag contents are being pulled into classes. Still looking into it.

— Reply to this email directly or view it on GitHubhttps://github.com/funginstitute/patentprocessor/issues/27#issuecomment-20848386 .

sent from mobile

gtfierro commented 11 years ago

When the parser asks for the tag inside a tag it's getting some stuff from though I'm not sure why. This might be a bug in the XML driver itself.