Closed patricknee closed 7 years ago
Putting try/catch inside DescriptionFigures.read() allows processing to proceed when Matcher.group seems to have an index error. This does not resolve the error; it results in a figure being skipped in the figures collection.
package gov.uspto.patent.doc.greenbook.items;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.dom4j.Node;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import gov.uspto.parser.dom4j.ItemReader;
import gov.uspto.patent.model.Figure;
public class DescriptionFigures extends ItemReader<List<Figure>>{
private static final Logger LOGGER = LoggerFactory.getLogger(DescriptionFigures.class);
private static final Pattern PATENT_FIG = Pattern.compile("^(FIG\\.? \\(?\\d{1,3}[A-Za-z]?\\)?)\\b");
private static final Pattern PATENT_FIGS = Pattern.compile("^(FIGS\\.? \\d{1,3}\\s?\\(?[A-Za-z]?\\)?(?:(?:\\s?\\-\\s?|, | and | to | through )\\d{0,3}\\(?[A-Za-z]?\\)?)+)\\b");
public DescriptionFigures(Node itemNode) {
super(itemNode);
}
@Override
public List<Figure> read() {
List<Figure> figures = new ArrayList<Figure>();
@SuppressWarnings("unchecked")
List<Node> paragraphNodes = itemNode.selectNodes("PAR");
for(Node paragraphN : paragraphNodes){
String ptext = paragraphN.getText();
Matcher matchFig = PATENT_FIG.matcher(ptext);
if (matchFig.lookingAt()){
try {
String id = matchFig.group(1);
String text = paragraphN.getText().substring(matchFig.end()+1);
Figure fig = new Figure(id, text);
figures.add(fig);
}
catch(Exception e) {
LOGGER.warn("Unable to Parse Patent Figure ID: '" + ptext, e);
}
} else {
Matcher matchFigs = PATENT_FIGS.matcher(ptext);
if (matchFigs.lookingAt()){
String id = matchFigs.group(1);
String text = paragraphN.getText().substring(matchFigs.end()+1);
Figure fig = new Figure(id, text);
figures.add(fig);
} else {
if (ptext.matches("^FIG")){
LOGGER.warn("Unable to Parse Patent Figure ID: '" + paragraphN.getText());
}
}
}
}
return figures;
}
}
Also put try/catch in NamePerson.getAbbreviatedName:
Also added LOGGER references to allow reporting the error.
This loses any first name information, but allows processing to proceed.
Examples showing up in the log appear to be blank text first names (firstName=" "), but I haven't looked further.
public String getAbbreviatedName() {
if (Strings.isNullOrEmpty(firstName)) {
try
{
return lastName + ", " + firstName.substring(0, 1) + '.';
}
catch (Exception e)
{
LOGGER.warn("Error parsing in getAbbreviatedName, firstName={} ", firstName, e);
return lastName;
}
} else {
return lastName;
}
}
Checked in fix for this which also fixes bug #13
If statement should have been checking if NOT null or empty. The bad logic was also causing the method to only return the last name. A unit test was also created.
Thanks you for pointing this out, Brian
With latest pull from GitHub, this crash is still happening (console output at end):
Year 2000 File: pftaps20000104_wk01.zip Failing patent (appears to be; could be the following item): US6010912
The DescriptionFigures class in the first comment (second post) of this thread allows this file to be processed, but does discard the information of the figure that cannot be parsed.
I don't seem to be able to re-open the issue. I presume you are notified that I commented, but just to be sure... let me know you see this. Otherwise, I will open a new issue after a bit.
2016-11-11 10:50:43,283 INFO [ main] US6010912 TransformerCli - Record: 'US6010912' from /Users/patrick/usptoData/grants/2000/pftaps20000104_wk01.zip:1755
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1931)
at gov.uspto.patent.doc.greenbook.items.DescriptionFigures.read(DescriptionFigures.java:38)
at gov.uspto.patent.doc.greenbook.fragments.DescriptionNode.read(DescriptionNode.java:42)
at gov.uspto.patent.doc.greenbook.Greenbook.parse(Greenbook.java:113)
at gov.uspto.parser.dom4j.keyvalue.KvParser.parse(KvParser.java:37)
at gov.uspto.patent.PatentReader.read(PatentReader.java:67)
at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:167)
at gov.uspto.patent.TransformerCli.process(TransformerCli.java:116)
at gov.uspto.patent.TransformerCli.main(TransformerCli.java:277)
I fixed an StringIndexOutOfBoundsException which occurs when a paragraph consists only of the matched text, such as a Paragraph containing only "FIG. 1".
I also moved the code to its own function so I could easily unit test the code.
Did you commit this change? It doesn't seem to be in my latest 'git checkout'...
Sorry, I seem to have gotten it with a pull.
Still getting the error on file Y2000 file pftaps20000104_wk01.zip. Looking into it now.
Fixed and code has been checked in.
Splitting out earlier post with multiple error types:
Running Y2000 file pftaps20000104_wk01.zip, to limit. Error follows: