edemaine / svgtiler

Tool for drawing diagrams on a grid, combining grids of SVGs into a big SVG figure
MIT License
61 stars 6 forks source link

Compiled PDFs are not reproducibly built #24

Closed jbosboom closed 6 years ago

jbosboom commented 6 years ago

Repeatedly invoking svgtiler -p results in PDFs with different hashes. Given that we like to commit these build products in version control for the benefit of those without svgtiler installed, this results in committing files that didn't actually (visibly) change. It also makes it hard for humans to tell which sheet(s) of a workbook changed (so the changes can be reviewed).

This is probably Inkscape's fault for storing a creation timestamp or similar in the compiled PDF, but it would be great to find a workaround.

edemaine commented 6 years ago

Since v1.4.2, I haven't encountered this. Have you? (I also looked for timestamps or similar in the PDF, but couldn't find any.)

jbosboom commented 6 years ago

I am still seeing this behavior with svgtiler 1.5.0 and Inkscape 0.92.2 5c3e80d, 2017-08-06.

I'm busy with the deadline right now but I'll try to reduce a test case for you after.

edemaine commented 6 years ago

Or at least let me know an example on our repo where this occurs. (I've tried some xlsx's but not all.) I assume you're running on Linux?

jbosboom commented 6 years ago

I generated vertical_dominoes_Literal Unset.pdf twice and diffed them. They differ in only four bytes. I used qpdf's "qdf mode" to get a text representation of the PDFs and diffed those. They differ only in the timestamp shown below.

--- old-text.pdf    2018-02-22 20:53:12.399383687 -0500
+++ new-text.pdf    2018-02-22 20:53:57.618647915 -0500
@@ -13,7 +13,7 @@
 %% Original object ID: 6 0
 2 0 obj
 <<
-  /CreationDate (D:20180222204225-05'00)
+  /CreationDate (D:20180222204251-05'00)
   /Producer (cairo 1.15.10 \(http://cairographics.org\))
 >>
 endobj

(They also differ in /ID, but this seems to be automatically generated by qpdf because it changes if I add --deterministic-id to the qpdf command line. The /ID is a 16-byte value, but there were only four bytes of difference in the PDF files.)

Yes, this is on Arch Linux. Maybe Inkscape uses a different backend (not cairo) on other platforms.

Due to the impending deadline I'm going to just commit the differing files anyway, but now you've something to go on.

edemaine commented 6 years ago

I'm using Inkscape 0.91 r13725 on Ubuntu which uses Cairo 1.14.6.

You seem to be using a different (later?) version of Inkscape which uses Cairo 1.15.10. So I'm guessing that newer Inkscapes inject the creation date like this. Now that I have an example, I should be able to blank out any such commands.

Of course, this won't help when you and I are recompiling with different Inkscape versions, so we generate different metadata (e.g. different /Producers). But it's better than nothing...

There are no /IDs in the files, so far as I can tell. You should just look at PDF in a text editor or less (go near the bottom), not qpdf.

edemaine commented 6 years ago

I tried installing Inkscape 0.92.2 via https://launchpad.net/~inkscape.dev/+archive/ubuntu/stable and it made no difference (but still used Cairo 1.14.6). So probably it's the difference in Cairo versions...

Anyway, I should have fixed this in 1.5.1 by blanking out /CreationDate if detected in the PDF. Can you test?

jbosboom commented 6 years ago

With this change, I do get the same hashes after rebuilding PDFs I've built.

I suspect someone's already dealt with the problem of putting PDFs in a canonical form for digital signature purposes, but as you note, this is better than nothing.

edemaine commented 6 years ago

https://github.com/matplotlib/matplotlib/pull/6597 seems to be one example of dealing with this, by injecting a CreationDate of SOURCE_DATE_EPOCH if set instead of the current date. I'm guessing Inkscape doesn't support this feature, though.