lucboruta commented 6 years ago

Brace codes

I noticed the use of "brace codes" in XML data, e.g. {acute over (e)} instead of é. They are used to encode non-ASCII characters, mostly diacritics and mathematical symbols, and they are visible in both PatFT/AppFT (e.g. here) and Google Patents (e.g. here and there).

There's a significant amount of variation in the way the codes are spelled out (insertions, deletions, or substitutions, maybe OCR artifacts?), e.g. {circumflex over (e)} occurring as {circumflexiover (e)} or {circumflexioveri(e)}, or {square root over (n)} occurring as {squaruaroot over (n)}. I've had good results using Damerau-Levenshtein distance to match codes to their canonical form.

Dot codes

Green Book documents use "dot codes", e.g. .alpha. to encode α. @bgfeldm developed gov.uspto.patent.doc.greenbook.DotCodes, but the class isn't referenced in the latest version of the parsers/readers.

Integration

I've forked the repo in https://github.com/thunken/PatentPublicData to implement these changes myself, making sure the changes are backward compatible (i.e. normalization is disabled by default). Dot code normalization is already done thanks to Brian's class, and I've started moving our code for brace codes to this codebase. I'll open a PR shortly.

bgfeldm commented 6 years ago

I just want to note that in the legal space of Patents sometimes it is best to keep things just as they are written. Patent examiners and lawyers only really trust the original application image, which has to remain pixel to pixel perfect even scaling is discouraged, and the text version needs to be as close as possible.

Dot codes

Dot Codes conversion has a small potential for changing text which are not dot codes. And Dot Codes largely represent mathematical symbols which currently pose little improvement for current search systems. We could disable conversion of any dot codes which have a likelihood of confusion, such as ".En.".

Brace Codes

I quickly created a class to replace Brace Codes with their Unicode equivalents. The largest problem which exist, and also the main reason to use Brace Codes, is when a character is not represented in unicode. Brace codes within mathematical equations are the hardest to process since they are often used for constants/variables and many do not appear in unicode. The characters covered in unicode are largely representative of those characters used within human languages and not fully the vast character options used within mathematics. Though, it may be useful to only perform brace code conversion to names of people and companies as well as titles of non-patent literature. The only question remains since they are highly likely to already have a unicode equivalent, then why is the unicode character not used by the applicant when filed in fields such as the inventor name. But since it occurs frequently enough in the data, conversion is useful.

bgfeldm commented 6 years ago

I came across the following documentation which talks about how to use accents with mathematics and unicode. https://unicode.org/reports/tr25/

lucboruta commented 6 years ago

I just want to note that in the legal space of Patents sometimes it is best to keep things just as they are written. Patent examiners and lawyers only really trust the original application image, which has to remain pixel to pixel perfect even scaling is discouraged, and the text version needs to be as close as possible.

I agree with this point, but dot codes and brace codes are non-standard encodings, and they make it hard to cross-reference USPTO data with other sources.

So while patent examiners and lawyers rely on the original application images, discovery systems and other data mining applications need the data to be normalized, at least for named entities (persons, organizations, locations). Persistent identifiers would solve many such problems, but we need workarounds for existing data.

I noticed that you have started pushing code that converts brace codes into Unicode, so I will hold off on submitting a PR. I pasted the class I had written into https://gist.github.com/lucboruta/9336cfd4e2f2cfe7d5391aae9e74382d, including the list of diacritics and symbols that I had found in the XML files. I hope you will find it useful.

USPTO / PatentPublicData

Optional normalization of dot codes and brace codes #59

Brace codes

Dot codes

Integration

Dot codes

Brace Codes