USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
182 stars 80 forks source link

Encoding used while json serializing #68

Closed aosingh closed 6 years ago

aosingh commented 6 years ago

For example consider the following inventor names. The inventor last-name has an HTML hex for the entity ø

<inventor sequence="004" designation="us-only">
               <addressbook>
                     <last-name>N&#xf8;rskov</last-name>
                    <first-name>Jens Kehlet</first-name>
                    <address>
                            <city>Naerum</city>
                             <country>DK</country>
                   </address>
               </addressbook>
</inventor>
<inventor sequence="005" designation="us-only">
               <addressbook>
                      <last-name>S&#xf8;rensen</last-name>
                      <first-name>Rasmus Zink</first-name>
                      <address>
                           <city>Vedb&#xe6;k</city>
                           <country>DK</country>
                      </address>
              </addressbook>
</inventor>

They get serialized to

{
            "sequence":"",
            "name":{
                "type":"person",
                "raw":"N?rskov, Jens Kehlet",
                "prefix":"",
                "firstName":"Jens Kehlet",
                "middleName":"",
                "lastName":"N?rskov",
                "suffix":"",
                "abbreviated":"N?rskov, J.",
                "synonyms":[
                ]
            },
            "address":{
                "street":"",
                "city":"Naerum",
                "state":"",
                "zipCode":"",
                "country":"DK",
                "email":"",
                "fax":"",
                "phone":""
            },
            "residency":"",
            "nationality":""
        },
        {
            "sequence":"",
            "name":{
                "type":"person",
                "raw":"S?rensen, Rasmus Zink",
                "prefix":"",
                "firstName":"Rasmus Zink",
                "middleName":"",
                "lastName":"S?rensen",
                "suffix":"",
                "abbreviated":"S?rensen, R.",
                "synonyms":[
                ]
            },
            "address":{
                "street":"",
                "city":"Vedb?k",
                "state":"",
                "zipCode":"",
                "country":"DK",
                "email":"",
                "fax":"",
                "phone":""
            },
            "residency":"",
            "nationality":""
        }
}
bgfeldm commented 6 years ago

I updated the code to force TransormerCLI output to the Unicode character set.

Also, you can force java to default to Unicode output by setting file.encoding, for example:

java -Dfile.encoding=UTF-8

Let me know if this fixes the problem.

aosingh commented 6 years ago

Sorry for the late feedback, but adding

java -Dfile.encoding=UTF-8

to the parser script solved the issue.