OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.77k stars 2.5k forks source link

PDF driver add trailer/EOF instructions that make them PDF/X invalid #10436

Open troopa81 opened 1 month ago

troopa81 commented 1 month ago

What is the bug?

I work on CMYK support for QGIS and try to generate PDF/X-4 (ready for print PDF specification) valid files.

QGIS uses GDAL to update PDF metadata (projection, author, creation date..) but when it does, it modifies the file in a way that the generated PDF is no longer PDF/X-4 valid.

Steps to reproduce the issue

trailer.<<./Size 21 ./Info 1 0 R./Root 6 0 R./ID [
<61323935376139312d653735322d346338632d616361352d663435303037633833333465> <61323935376139312d653735322d346338632d616361352d663435303037633833333465> ].
>>.
startxref.
2676923 .
%%EOF.
from osgeo import gdal
ds = gdal.Open("/tmp/test_cmyk_no_gdal.pdf", gdal.GA_Update)
ds.SetMetadataItem( "AUTHOR", "Julien" ) 
.trailer.<<./Size 21 ./Info 1 0 R./Root 6 0 R./ID [
<61323935376139312d653735322d346338632d616361352d663435303037633833333465> <61323935376139312d653735322d346338632d616361352d663435303037633833333465> ].
>>.
startxref.
2676923 .
%%EOF.
1 0 obj.
<< /Author (Julien) /CreationDate (D:20240718093827+02'00') /Producer (Qt 6.8.0) /Title (/home/julien/Nextcloud/Temp/test_cmyk.pdf) >>.
endobj.
xref.
0 1.
0000000000 65535 f .
1 1.0002677584 00000 n .
trailer.
<< /Info 1 0 R /Prev 2676923 /Root 6 0 R /Size 21 >>.
startxref.
2677734.
%%EOF.

The Prev 2676923 instruction seems to reference the previous trailer, so it might be fine (though I don't know much about PDF specification), but we have 2 EOF instructions and I don't think it's OK.

I used preflight tool (prépresse in French) from Adobe Acrobat Reader pro to check if generated files are PDF/X-4 valid and the GDAL modified one get an extra error : "Absence d'ID du document" (Missing document ID)

Before GDAL sans_gdal

After GDAL avec_gdal

Versions and provenance

Additional context

Just a side note, GDAL doesn't manage XMP metadata consistency. Meaning that if I change the metadata item CREATION_DATE, the related XMP metadata instruction is not udpated accordingly, and so the PDF is not PDF/X-4 valid.

It looks like that GDAL doesn't want to assure consistency and I plan to do it in QGIS, so no extra issue here. But please correct me if I'm wrong and you think that it should be fixed in GDAL.

rouault commented 1 month ago

but we have 2 EOF instructions and I don't think it's OK.

At least, for regular PDFs, that's fine. The PDF spec (version 1.7) mentions at page 99: "a file that has been updated several times contains several trailers; each trailer is terminated by its own end-of-file (%%EOF ) marker"

I suspect that PDF/X-4 has stronger requirements that the base PDF spec. PDF is a super complicated format, and GDAL mostly do it "at hand" (at least on the writing side). I've no idea what supporting PDF/X-4 would involve. Perhaps the standard update procedure doesn't work for PDF/X-4, and that you need to generate a new file, actually updating original objects, instead of appending the updates?

GDAL doesn't manage XMP metadata consistency.

"obviously" not :-)

nyalldawson commented 1 month ago

@troopa81 you might have more luck using the pdf composition XML file approach (which is used in QGIS for geopdf exports) to generate a completely new pdf from the input one

rouault commented 1 month ago

you might have more luck using the pdf composition XML file approach (which is used in QGIS for geopdf exports) to generate a completely new pdf from the input one

indeed. But that will just generate a new regular PDF file, not a PDF/X-4.

troopa81 commented 1 month ago

At least, for regular PDFs, that's fine. The PDF spec (version 1.7) mentions at page 99: "a file that has been updated several times contains several trailers; each trailer is terminated by its own end-of-file (%%EOF ) marker"

OK, so "maybe" this is because the second trailer lacks a ID instruction ? I'll try to paste the first one in the second trailer to check if it complies

you might have more luck using the pdf composition XML file approach (which is used in QGIS for geopdf exports) to generate a completely new pdf from the input one

indeed. But that will just generate a new regular PDF file, not a PDF/X-4.

Yes, I don't know if it's feasible to have at the same time a GeoPdf which complies to the PDF/X-4 format. That would require to add the embedded ICC profile to the XML composition file (But I know little about the way the geopdf are exported).

troopa81 commented 2 weeks ago

OK, so "maybe" this is because the second trailer lacks a ID instruction ? I'll try to paste the first one in the second trailer to check if it complies

I confirm that the issue comes from the missing ID in the second trailer. If I just copy/paste the ID from the previous trailer, it complies.

From the PDF/X-4 specification, I only read

The ID key in the file trailer shall be present.

I try to look for a fix in Gdal. I understand that the Info trailer is set to be updated when we modify the metadata, which lead to this comment. IIRC podofo is in charge of updating/fix the file on write and so would be the culprit here. But I'm unsure because the gdal documentation states that no dependencies is used on write.

rouault commented 1 week ago

I try to look for a fix in Gdal. I understand that the Info trailer is set to be updated when we modify the metadata, which lead to this comment. IIRC podofo is in charge of updating/fix the file on write and so would be the culprit here. But I'm unsure because the gdal documentation states that no dependencies is used on write.

@troopa81 Update is a bit of a mix. Poppler or Podofo are used to build the existing PDF object hierarchy, but update/writing is done "at hand" in GDALPDFBaseWriter::WriteXRefTableAndTrailer() in frmts/pdf/pdfcreatecopy.cpp