PatentsView / PatentsView-DB

35 stars 15 forks source link

Parsing of cited patent numbers is incorrect for design patents #3

Closed crew102 closed 7 years ago

crew102 commented 7 years ago

Greetings -

Some of the cited patent numbers are incorrect in the PatentsView bulk data files (and also incorrect in the data returned by the PatentsView API, I assume). For example, here is a line from uspatentcitation.tsv:

uuid patent_id citation_id date name kind country category sequence
00038yk20qeck77def7hdih38 8833823 668832 2012-10-01 Auf der Maur S US cited by applicant 63

When we look up for the corresponding patent on uspto.gov we see that citation_id should be "D668832", not "668832" (i.e., the leading "D" is getting trimmed). The raw XML from ipgb20140916_wk37.zip confirms that the number should start with a "D":

<patcit num="00064">
<document-id>
<country>US</country>
<doc-number>D668832</doc-number>
<kind>S</kind>
<name>Auf der Maur</name>
<date>20121000</date>
</document-id>
</patcit>

The issue occurs in the snippet of code shown below. Here, the parser assumes that any patent number that starts with an uppercase alpha char is a patent application. A patent number like "D668832", assumed to be an application, ends up going all the way down to the else statement, where only its digit chars get used for its number. https://github.com/CSSIP-AIR/PatentsView-DB/blob/feab6b9ee827aba952a9d0cde8658897528a3d34/Scripts/Raw_Data_Parsers/uspto_parsers/parser_2005_new_fields_g.py#L626-L639

Everst commented 7 years ago

Ah, that might be a bug indeed. We will look into that. Thanks Chris!

On Sep 7, 2017, at 9:07 PM, Chris Baker notifications@github.com wrote:

Greetings -

Some of the cited patent numbers are incorrect in the PatentsView bulk data files (and also incorrect in the data returned by the PatentsView API, I assume). For example, here is a line from uspatentcitation.tsv:

uuid patent_id citation_id date name kind country category sequence 00038yk20qeck77def7hdih38 8833823 668832 2012-10-01 Auf der Maur S US cited by applicant 63 When we look up for the corresponding patent on uspto.gov we see that citation_id should be "D668832", not "668832" (i.e., the leading "D" is getting trimmed). The raw XML from ipgb20140916_wk37.zip confirms that the number should start with a "D":

US D668832 S Auf der Maur 20121000

The issue occurs in the snippet of code shown below. Here, the parser assumes that any patent number that starts with an uppercase alpha char is a patent application. A patent number like "D668832", assumed to be an application, ends up going all the way down to the else statement, where only its digit chars get used for its number. https://github.com/CSSIP-AIR/PatentsView-DB/blob/feab6b9ee827aba952a9d0cde8658897528a3d34/Scripts/Raw_Data_Parsers/uspto_parsers/parser_2005_new_fields_g.py#L626-L639

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.