USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
184 stars 80 forks source link

`mvn clean package` fails with `unmappable character` errors #100

Open sotnikov-s opened 4 years ago

sotnikov-s commented 4 years ago

Thanks for the quick fix of https://github.com/USPTO/PatentPublicData/issues/99 I pulled the new version from master and tried to rebuild the project by running mvn clean package -DskipTests=true but got a number of errors like

java/gov/uspto/patent/doc/xml/BraceCode.java:[131,35] unmappable character (0xA0) for encoding UTF-8

The full text of the build is here output.pdf Are you encountering the same problem?

jvd10 commented 4 years ago

I think that it just means that the file encoding is wrong. On osx:

file --mime-encoding BraceCode.java BraceCode.java: iso-8859-1

Fix with:

iconv -f iso-8859-1 -t utf-8 < BraceCode.java > BraceCode.java

jindrichmynarz commented 4 years ago

Note that iconv first truncates the output file, so that piping its output to the same path as its input results in an empty file. I'm currently using this script to fix the source code:

#!/usr/bin/env bash

set -eo pipefail
shopt -s failglob

FILES="PatentDocument/src/main/java/gov/uspto/patent/doc/xml/BraceCode.java
PatentDocument/src/main/java/gov/uspto/patent/model/NplCitation.java
PatentDocument/src/main/java/gov/uspto/patent/model/DocumentId.java
PatentDocument/src/main/java/gov/uspto/patent/doc/greenbook/DotCodes.java
PatentDocument/src/main/java/gov/uspto/patent/doc/pap/FormattedText.java
PatentDocument/src/main/java/gov/uspto/patent/doc/sgml/FormattedText.java"

TMPFILE=`mktemp`
for FILE in ${FILES}
do
  iconv -f iso-8859-1 -t utf-8 < ${FILE} > ${TMPFILE} &&
  mv -f ${TMPFILE} ${FILE}
done

# Fix incorrect type signature for constructor
sed -i '' 's/, "<XX>"//g' PatentDocument/src/main/java/gov/uspto/tm/doc/brs/TmBrs.java
sotnikov-s commented 4 years ago

that was nice, the encoding error got resolved, but another one occurred:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project PatentDocument: Compilation failure: Compilation failure:
[ERROR] PatentPublicData/PatentDocument/src/main/java/gov/uspto/tm/doc/brs/TmBrs.java:[32,29] cannot find symbol
[ERROR]   symbol:   class DateUtil
[ERROR]   location: package gov.uspto.common.text
[ERROR] PatentPublicData/PatentDocument/src/main/java/gov/uspto/tm/doc/brs/TmBrs.java:[228,32] cannot find symbol
[ERROR]   symbol:   variable DateUtil
[ERROR]   location: class gov.uspto.tm.doc.brs.TmBrs

seems like it imports a nonexistent class cause there is no mention of both DateUtil and its called toDateTimeISO method throughout the whole project

bgfeldm commented 4 years ago

Sorry about the missing DataUtil java class it's now checked in. My IDE seems to not to be bothered by any character encoding issues, but I will continue to look into it.

sotnikov-s commented 4 years ago

thanks, with the above-mentioned script and the newest version the build succeeds