USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
182 stars 80 forks source link

46mb single patent file #14

Closed patricknee closed 7 years ago

patricknee commented 7 years ago

After being written into an individual json file by TransformerCli this file is 42mb (other files are ~100k). It is filled with this following section of the json file (trimmed).

The google copy of the patent looks "normal" without this type of data. Is this an error?

Year: 2010 file: ipg100105.zip US7640662B2.json

   ```
 {
            "type":"main",
            "raw":"2989003-890054",
            "normalized":"029/100000000,100001000,100002000,100003000,100004000,100005000,100006000,100007000,100008000,100009000,100010000,100011000,100012000,100013000,100014000,100015000,100016000,100017000,100018000,100019000,100020000,100021000,100022000,100023000,100024000,100025000,100026000,100027000,100028000,100029000,100030000,10003100....trimmed......00,995450000,995460000,995470000,995480000,995490000,995500000,995510000,995520000,995530000,995540000,995550000,995560000,995570000,995580000,995590000,995600000,995610000,995620000,asdf",
            "facets":[
                "1/029/029287426000",
                "1/029/029267810000",
                "1/029/029148404000",
                "1/029/029484454000",
                "1/029/029365048000",
                "1/029/029345432000",

....trimmed....

bgfeldm commented 7 years ago

I know what this is, when there is a classification range, I create all permutations within the number range. This does two things: 1) not all permutations are actual defined classifications, even though it aids searching 2) large ranges.

I should probably limit the range to something small. And in the future come up with a better way of handling ranges. And actual lookup would be beneficial but right now I am limiting it to what is in the document.

bgfeldm commented 7 years ago

Looking at the number range above, I should be better handling the range.

The code which matches and expands the range is here: gov.uspto.patent.model.classification.UspcClassification lines 233-258

I will further look into this.

bgfeldm commented 7 years ago

I have checked in code which will throw an ParseException for any Range over 99.

patricknee commented 7 years ago

OK, I will do a pull and run through my spot test of the Hits of the 80s, 90s, 00s, and 10s.

patricknee commented 7 years ago

This appears resolved.