funginstitute / patentprocessor

BSD 2-Clause "Simplified" License
68 stars 31 forks source link

Parsing Classes (sequence and values) #13

Open laironald opened 11 years ago

laironald commented 11 years ago

does the parsing of the classes strip away the "." and other punctuation? when i compare patent # 8087209 (ipg120103) with the USPTO equivalent, I see differences.

The parser returns [[u'52', u'7168'], [u'52', u'7161'], [u'52', u'463'], [u'52', u'464'], [u'52', u'2881']]

On the USPTO site, I see 52/716.8. Also, do we know why we see things in this order? The order differs from what is on the USPTO website.

gtfierro commented 11 years ago

Is this related to #8?

laironald commented 11 years ago

not related

On Thu, Jun 27, 2013 at 1:59 PM, Gabe Fierro notifications@github.comwrote:

Is this related to #8https://github.com/funginstitute/patentprocessor/issues/8 ?

— Reply to this email directly or view it on GitHubhttps://github.com/funginstitute/patentprocessor/issues/13#issuecomment-20154544 .

sent from mobile

gtfierro commented 11 years ago

Looking in the XML for ipg120103, I can see that the main-classification tag that indicates the US class is the following

<main-classification> 527168</main-classification> 

According to the current USPTO XML schema 4.2, the first 3 characters are the class and the last characters are the subclass. This gives us class = 52 and subclass = 7168. Again according to the USPTO XML schema 4.2, the first 3 decimals of the subclass are to the left of the decimal place, giving us subclass = 716.8, so it's definitely possible to parse out 52/716.8. The other classes I believe are from the US classifications of the cited patents.

Should we extract the classes in this way?

laironald commented 11 years ago

gotcha. so if we see >3 decimals, then we can add a period. what happens in other cases?

gtfierro commented 11 years ago

From the documentation

Table 6 - U.S. Patent Classifications Class – A 3-position alphanumeric field right justified with leading spaces. Design Patents – The first position will contain a “D”. Positions 2 and 3, right justified, with a leading space when required for a single digit class. Plant Patents – Positions 1-3 will contain a “PLT” All Other Patents – Three alphanumeric positions, right justified, with leading spaces Sub-Class – Three alphanumeric positions, right justified with leading spaces, and, if present, one to three >positions to the right of the decimal point (assumed decimal in the Red Book XML), left justified.

Note: An unstructured US classification would identify a sub-class as a range with the sub-class range being separated by a hyphen “-“ A digest entry as a sub-class would appear as follows: Three positions containing “DIG”, followed by one to three alphanumeric positions, left justified.

laironald commented 11 years ago

right this stuff is so confusing. its like creating structure within a small field because the peeps at that USPTO team didn't want to think about creating new tags. i thin we can definitely add value by applying those rules as most people wouldn't bother with this... what do you think? (i know its painful)

On Mon, Jul 1, 2013 at 3:06 PM, Gabe Fierro notifications@github.comwrote:

From the documentation

Table 6 - U.S. Patent Classifications Class – A 3-position alphanumeric field right justified with leading spaces. Design Patents – The first position will contain a “D”. Positions 2 and 3, right justified, with a leading space when required for a single digit class. Plant Patents – Positions 1-3 will contain a “PLT” All Other Patents – Three alphanumeric positions, right justified, with leading spaces Sub-Class – Three alphanumeric positions, right justified with leading spaces, and, if present, one to three >positions to the right of the decimal point (assumed decimal in the Red Book XML), left justified.

Note: An unstructured US classification would identify a sub-class as a range with the sub-class range being separated by a hyphen “-“ A digest entry as a sub-class would appear as follows: Three positions containing “DIG”, followed by one to three alphanumeric positions, left justified.

— Reply to this email directly or view it on GitHubhttps://github.com/funginstitute/patentprocessor/issues/13#issuecomment-20314062 .

sent from mobile

doolin commented 11 years ago

BNF -> PEG (if possible) -> Test drive.

http://fdik.org/pyPEG/

gtfierro commented 11 years ago

I don't think it's really that complicated; we just have to decide how we want to transform the strings.

The basic form is <class>/<sub-class>.<more-sub-class>. This is simple enough for Design Patents. For Plant patents, the first 3 characters are PLT, which seems to function as a class.

I think if we break up the class strings as:

and don't strip the spaces, we should be fine

laironald commented 11 years ago

hey gabe. what does this data look like in DVN? to whatever extent we might want to match that, so its compatible.

On Tue, Jul 2, 2013 at 9:50 AM, Gabe Fierro notifications@github.comwrote:

I don't think it's really that complicated; we just have to decide how we want to transform the strings.

The basic form is /.. This is simple enough for Design Patents. For Plant patents, the first 3 characters are PLT, which seems to function as a classhttp://www.uspto.gov/web/offices/ac/ido/oeip/taf/def/plt.htm .

I think if we break up the class strings as:

  • class: string[:3]
  • subclass: string[3:6]
  • moresubclass: string[6:]

and don't strip the spaces, we should be fine

— Reply to this email directly or view it on GitHubhttps://github.com/funginstitute/patentprocessor/issues/13#issuecomment-20358963 .

sent from mobile

gtfierro commented 11 years ago

From what I can see, all the rows in /data/patentdata/DVNFIXED/class.sqlite3 look like

Patent | Prim | Class | Subclass
03930270 | 1 | 360 | 130.24 

so because the current code doesn't handle the subclass decimals, if we handle that, then we should be great in terms of backwards compatibility.