USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
182 stars 80 forks source link

Year 2000 grant: getAbbreviatedName index out of bounds error #12

Closed patricknee closed 7 years ago

patricknee commented 7 years ago

Splitting out earlier post with multiple error types:

Running Y2000 file pftaps20000104_wk01.zip, to limit. Error follows:

patrick$ java -cp PatentDocument/target/*:PatentDocument/target/dependency-jars/*:resources gov.uspto.patent.TransformerCli --input="BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip" --outBulk=false --outdir="BulkDownloader/download/grants/2000/expanded/"
2016-11-08 21:50:34,345 INFO  [main] TransformerCli - --- Start ---
2016-11-08 21:50:34,379 INFO  [main] TransformerCli - Dump File[1]: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip
2016-11-08 21:50:34,382 INFO  [main] PatentDocFormatDetect - PatentDocFormat fromFileName: Greenbook
2016-11-08 21:50:34,386 INFO  [main] ZipReader - Reading zip file: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip
2016-11-08 21:50:34,418 INFO  [main] ZipReader - Found 1 file[FileFilter [matchRules=[]]]: pftaps20000104_wk01.txt
2016-11-08 21:50:34,420 INFO  [main] PatentDocFormatDetect - PatentType fromContent: Greenbook
2016-11-08 21:50:34,599 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0205' from : <CLAS><OCL>D 2602</OCL><XCL>D2608</XCL><XCL>D2627</XCL><EDF>6</EDF><ICL>0205</ICL><FSC>D 2</FSC><FSS>600;602;608;609;624;627;639;859;891</FSS><FSC>2</FSC><FSS>236;237;232;269;338;124;125;115</FSS></CLAS>
2016-11-08 21:50:34,608 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:50:34,709 INFO  [main] TransformerCli - Record: 'US418273' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:1
2016-11-08 21:50:34,779 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0207' from : <CLAS><OCL>D 2639</OCL><EDF>6</EDF><ICL>0207</ICL><FSC>D11</FSC><FSS>200-201;261;55-66</FSS><FSC>D 2</FSC><FSS>639</FSS><FSC>297</FSC><FSS>482</FSS><FSC>24</FSC><FSS>633</FSS><FSC>D21</FSC><FSS>604</FSS><FSC>428</FSC><FSS>100</FSS></CLAS>
2016-11-08 21:50:34,782 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:50:34,784 INFO  [main] TransformerCli - Record: 'US418275' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:2
2016-11-08 21:50:34,810 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0202' from : <CLAS><OCL>D 2743</OCL><EDF>6</EDF><ICL>0202</ICL><FSC>D 2</FSC><FSS>728;743;746;753;830;838</FSS><FSC>D 5</FSC><FSS>62</FSS><FSC>2</FSC><FSS>69;94;1;40;227</FSS></CLAS>
2016-11-08 21:50:34,813 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:50:34,818 INFO  [main] TransformerCli - Record: 'US418277' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:3
2016-11-08 21:50:34,841 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0203' from : <CLAS><OCL>D 2882</OCL><EDF>6</EDF><ICL>0203</ICL><FSC>D 2</FSC><FSS>865;866;872;876;879;882;883;884;886;887</FSS><FSC>D29</FSC><FSS>102;104</FSS><FSC>2</FSC><FSS>171;181;195.1;412;419;209.13</FSS></CLAS>
2016-11-08 21:50:34,850 WARN  [main] AbstractTextNode - Patent does not have an Abstract
(Trimmed)
2016-11-08 21:50:58,508 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'G02F  11335' from : <CLAS><OCL>349113</OCL><XCL>349139</XCL><EDF>6</EDF><ICL>G02F  11335</ICL><ICL>G02F  11343</ICL><FSC>349</FSC><FSS>113;149;151;152;139</FSS></CLAS>
2016-11-08 21:50:58,509 INFO  [main] TransformerCli - Record: 'US6011605' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:1224
2016-11-08 21:50:58,521 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'G02F  11339' from : <CLAS><OCL>349153</OCL><XCL>349151</XCL><EDF>6</EDF><ICL>G02F  11339</ICL><FSC>349</FSC><FSS>149;151;153</FSS></CLAS>
2016-11-08 21:50:58,522 INFO  [main] TransformerCli - Record: 'US6011607' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:1225
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 1
    at java.lang.String.substring(String.java:1963)
    at gov.uspto.patent.model.entity.NamePerson.getAbbreviatedName(NamePerson.java:75)
    at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:370)
    at gov.uspto.patent.serialize.JsonMapper.mapAssignees(JsonMapper.java:319)
    at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:101)
    at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:68)
    at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:56)
    at gov.uspto.patent.TransformerCli.write(TransformerCli.java:203)
    at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:188)
    at gov.uspto.patent.TransformerCli.process(TransformerCli.java:116)
    at gov.uspto.patent.TransformerCli.main(TransformerCli.java:276)
patricknee commented 7 years ago

Putting try/catch inside DescriptionFigures.read() allows processing to proceed when Matcher.group seems to have an index error. This does not resolve the error; it results in a figure being skipped in the figures collection.

package gov.uspto.patent.doc.greenbook.items;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.dom4j.Node;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import gov.uspto.parser.dom4j.ItemReader;
import gov.uspto.patent.model.Figure;

public class DescriptionFigures extends ItemReader<List<Figure>>{
    private static final Logger LOGGER = LoggerFactory.getLogger(DescriptionFigures.class);

    private static final Pattern PATENT_FIG = Pattern.compile("^(FIG\\.? \\(?\\d{1,3}[A-Za-z]?\\)?)\\b");
    private static final Pattern PATENT_FIGS = Pattern.compile("^(FIGS\\.? \\d{1,3}\\s?\\(?[A-Za-z]?\\)?(?:(?:\\s?\\-\\s?|, | and | to | through )\\d{0,3}\\(?[A-Za-z]?\\)?)+)\\b");

    public DescriptionFigures(Node itemNode) {
        super(itemNode);
    }

    @Override
    public List<Figure> read() {
        List<Figure> figures = new ArrayList<Figure>();

        @SuppressWarnings("unchecked")
        List<Node> paragraphNodes = itemNode.selectNodes("PAR");

        for(Node paragraphN : paragraphNodes){
            String ptext = paragraphN.getText();

            Matcher matchFig = PATENT_FIG.matcher(ptext);
            if (matchFig.lookingAt()){
                try {
                    String id = matchFig.group(1);
                    String text = paragraphN.getText().substring(matchFig.end()+1);
                    Figure fig = new Figure(id, text);
                    figures.add(fig);
                }
                catch(Exception e) {
                    LOGGER.warn("Unable to Parse Patent Figure ID: '" + ptext, e);
                }
            } else {
                Matcher matchFigs = PATENT_FIGS.matcher(ptext);
                if (matchFigs.lookingAt()){
                    String id = matchFigs.group(1);
                    String text = paragraphN.getText().substring(matchFigs.end()+1);
                    Figure fig = new Figure(id, text);
                    figures.add(fig);
                } else {
                    if (ptext.matches("^FIG")){
                        LOGGER.warn("Unable to Parse Patent Figure ID: '" + paragraphN.getText());
                    }
                }
            }
        }

        return figures;
    }
}
patricknee commented 7 years ago

Also put try/catch in NamePerson.getAbbreviatedName:

Also added LOGGER references to allow reporting the error.

This loses any first name information, but allows processing to proceed.

Examples showing up in the log appear to be blank text first names (firstName=" "), but I haven't looked further.

    public String getAbbreviatedName() {
        if (Strings.isNullOrEmpty(firstName)) {
            try
            {
                return lastName + ", " + firstName.substring(0, 1) + '.';
            }
            catch (Exception e)
            {
                LOGGER.warn("Error parsing in getAbbreviatedName, firstName={} ", firstName, e);
                return lastName;
            }
        } else {
            return lastName;
        }
    }
bgfeldm commented 7 years ago

Checked in fix for this which also fixes bug #13

If statement should have been checking if NOT null or empty. The bad logic was also causing the method to only return the last name. A unit test was also created.

Thanks you for pointing this out, Brian

patricknee commented 7 years ago

With latest pull from GitHub, this crash is still happening (console output at end):

Year 2000 File: pftaps20000104_wk01.zip Failing patent (appears to be; could be the following item): US6010912

The DescriptionFigures class in the first comment (second post) of this thread allows this file to be processed, but does discard the information of the figure that cannot be parsed.

I don't seem to be able to re-open the issue. I presume you are notified that I commented, but just to be sure... let me know you see this. Otherwise, I will open a new issue after a bit.

2016-11-11 10:50:43,283 INFO  [       main] US6010912            TransformerCli - Record: 'US6010912' from /Users/patrick/usptoData/grants/2000/pftaps20000104_wk01.zip:1755
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.substring(String.java:1931)
    at gov.uspto.patent.doc.greenbook.items.DescriptionFigures.read(DescriptionFigures.java:38)
    at gov.uspto.patent.doc.greenbook.fragments.DescriptionNode.read(DescriptionNode.java:42)
    at gov.uspto.patent.doc.greenbook.Greenbook.parse(Greenbook.java:113)
    at gov.uspto.parser.dom4j.keyvalue.KvParser.parse(KvParser.java:37)
    at gov.uspto.patent.PatentReader.read(PatentReader.java:67)
    at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:167)
    at gov.uspto.patent.TransformerCli.process(TransformerCli.java:116)
    at gov.uspto.patent.TransformerCli.main(TransformerCli.java:277)
bgfeldm commented 7 years ago

I fixed an StringIndexOutOfBoundsException which occurs when a paragraph consists only of the matched text, such as a Paragraph containing only "FIG. 1".

I also moved the code to its own function so I could easily unit test the code.

patricknee commented 7 years ago

Did you commit this change? It doesn't seem to be in my latest 'git checkout'...

patricknee commented 7 years ago

Sorry, I seem to have gotten it with a pull.

patricknee commented 7 years ago

Still getting the error on file Y2000 file pftaps20000104_wk01.zip. Looking into it now.

bgfeldm commented 7 years ago

Fixed and code has been checked in.