Pdf Stream validation broken

Sarenor commented 3 weeks ago

This commit breaks PDF-Stream validation since the ZugferdValidator calls the setFilename method of the PdfValidator

        @Override
    public void setFilename(String filename) throws IrrecoverableValidationError {
        this.pdfFilename = filename;
        try {
                        // A Streamed File will never be found like this.
            fileContents=Files.readAllBytes(Paths.get(pdfFilename));
        } catch (IOException ex) {
            throw new IrrecoverableValidationError("Could not read file");
        }
    }

Since the final results depends on string results being written, with no results being written yet when the exception is thrown, any (pdf) file will be considered valid.

jstaerk commented 3 weeks ago

I don't understand, we do need setFilename to actually read the file (where else?) and the validator can and should should catch IrrecoverableValidationError and add it to the validation XML (and sometimes even does so, e.g. validator l 330). And I guess whenever a PDF-stream would be validated setFilename() would never be called because there is no filename at all, so changes in this method should not affect stream validation?

Sarenor commented 3 weeks ago

It's called here in the ZUGFeRDValidator:

public String validate(InputStream inputStream, String fileNameOfInputStream) {
        boolean xmlValidity;
        context.clear();
        StringBuffer finalStringResult = new StringBuffer();
        SimpleDateFormat isoDF = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
        Date date = new Date();
        startTime = Calendar.getInstance().getTimeInMillis();
        context.setFilename(fileNameOfInputStream);// set filename without path
        finalStringResult.append("<validation filename='").append(context.getFilename()).append("' datetime='").append(isoDF.format(date)).append("'>");

        boolean isPDF = false;
        byte[] content = new byte[0];
        try {

            if (fileNameOfInputStream == null) {
                optionsRecognized = false;
                context.addResultItem(new ValidationResultItem(ESeverity.fatal, "Filename not specified").setSection(10)
                    .setPart(EPart.pdf));
            }

            PDFValidator pdfv = new PDFValidator(context);
            if (inputStream == null) {
                context.addResultItem(
                    new ValidationResultItem(ESeverity.fatal, "File not found").setSection(1).setPart(EPart.pdf));
            } else if (inputStream.available() < 32) {
                // with less then 32 bytes it can not even be a proper XML file
                context.addResultItem(
                    new ValidationResultItem(ESeverity.fatal, "File too small").setSection(5).setPart(EPart.pdf));
            } else {
                content = IOUtils.toByteArray(inputStream);
                isPDF = ByteArraySearcher.contains(content, new byte[]{'%', 'P', 'D', 'F'});
                XMLValidator xv = new XMLValidator(context);
                if (isPDF) {
                    pdfv.setFilename(fileNameOfInputStream);
                    pdfv.setFileContents(content);

                    optionsRecognized = true;
                    finalStringResult.append("<pdf>");

~~Removing it there would probably work, since the file name is only used for Validation Messages and errors.~~

Edit & Update: Removing it there wouldn't work, because VeraPDF ItemDetails expect the name to not be null.

Either passing in the Filecontents via PdfValidator.setFileContents or a dedicated load file method might be an option?

At least in the validate(InputStream contents, String filename) method, the only contents of the final Result StringBuffer is going to be

finalStringResult.append("<validation filename='").append(context.getFilename()).append("' datetime='").append(isoDF.format(date)).append("'>");

then the exception from the setFilename triggers the try-catch and the final results are processed.

jstaerk commented 3 weeks ago

can you set this.pdfFilename = filename; directly or can we have a non-loading method e.g. with additional params (defaulting to load)? And can we maybe introdcue have one or two tests of the streaming validator? @phax had also requested more streaming and I would not likt anybody to accidentally destroy this functionality and going unnoticed...

Sarenor commented 3 weeks ago

We had some tests for the streaming validator, I'm surprised it didn't catch it actually. Having it with a loadFile boolean sounds like a good compromise, yeah. I'll see what I can do on Monday, including additional test cases.

phax commented 2 weeks ago

Inside the #406 PR I took the liberty to harmonize the "validate" methods and add another overload for byte[] - this also reduces the amount of IO needed, as the file to be validated is read no more then once. However, the general Validator interface is no longer 100% accurate, but I hope that doesn't do any harm. Also I optimized the ByteArraySearcher so that it is more efficient, especially for large non-matching files.

ZUGFeRD / mustangproject

Pdf Stream validation broken #403