Generating PDF/A conforming PDFs

sheppie123 commented 6 years ago

Is it possible to generate PDFs that conform to PDF/A using Weasyprint? From wikipedia:

Other key elements to PDF/A compatibility include:

Audio and video content are forbidden.

JavaScript and executable file launches are forbidden.

All fonts must be embedded and also must be legally embeddable for unlimited, universal rendering. This also applies to the so-called
PostScript standard fonts such as Times or Helvetica.

Colorspaces specified in a device-independent manner.

Encryption is disallowed.

Use of standards-based metadata is mandated.

Many Thanks

LukasKlement commented 6 years ago

I opened a ticket on PDF X/3 compliance: https://github.com/Kozea/WeasyPrint/issues/640

Perhaps to start the discussion on what direction WeasyPrint should take, it may be worthwhile to collect the purpose of the different standards:

PDF A -> a standard used predominantly for document archiving PDF X -> a standard used predominantly for professional print (e.g. offset print)

For detailed differences on the two standards, see page 17 of this document: https://www.impressed.de/DOWNLOADS/pdfToolbox_Server/callas_pdfEngine_Reference.pdf

I attach great importance to PDF X, as I believe achieving full print-compliance is an absolute necessity for a mature PDF creation/conversion tool.

liZe commented 6 years ago

I've tried to give Acrobat various PDF files generated by WeasyPrint… It's awful, there are many, many, many things to fix before reaching PDF/A or PDF/X conformance.

I attach great importance to PDF X, as I believe achieving full print-compliance is an absolute necessity for a mature PDF creation/conversion tool.

I agree, but there's a long way waiting for us.

hejsan commented 4 years ago

Hi - opening this can of worms - can we list the things needed to conform to PDF/A? @liZe You mention giving Acrobat various PDF files generated by WeasyPrint, how does it tell you what's amiss? I thought it was mostly about not referencing any outside files and embedding all fonts - is weasyprint not complying with that already?

liZe commented 4 years ago

opening this can of worms

🐛🐛🐛🐛🐛🐛🐛🐛

can we list the things needed to conform to PDF/A?

That would be really useful.

@liZe You mention giving Acrobat various PDF files generated by WeasyPrint, how does it tell you what's amiss?

I don’t really remember, but I think that there’s a PDF validator in Acrobat (not in Reader, it’s not free :cry:).

Does anyone know an open source (or at least free) tool to check PDF/A and PDF/X conformance?

I thought it was mostly about not referencing any outside files and embedding all fonts - is weasyprint not complying with that already?

As far as I can remember, there were lots of errors, and most of them were just impossible to fix with Cairo. I think that we need a dedicated PDF generator for that (see #841).

hejsan commented 4 years ago

I seem to recall Apache PDFBox having some features, I'll have to check better though.

I think that we need a dedicated PDF generator for that

Maybe this is another use for a post-processor that would parse through the pdf and do what is needed. Seems like a massive undertaking though if it is supposed to support changing everything to be pdf/a compliant. Might be smart to start by being able to convert simple pdf's that don't include edge cases like embedded video etc.

edit: I was looking through #841 and must say I somewhat disagree about getting rid of external dependencies unless they're proving to be severe limiting factors (Maybe Cairo is?). They're literally what make big opensource projects viable and not just a massive liability to the developers.

liZe commented 4 years ago

Might be smart to start by being able to convert simple pdf's that don't include edge cases like embedded video etc.

The current post-processor only knows how to parse PDF files generated by Cairo. It removes a lot of edge cases.

edit: I was looking through #841 and must say I somewhat disagree about getting rid of external dependencies unless they're proving to be severe limiting factors (Maybe Cairo is?). They're literally what make big opensource projects viable and not just a massive liability to the developers.

Of course, removing all external dependencies is not a goal per se. But there are some reasons why it would be interesting to consider getting rid of some of them:

Having non-Python dependencies is the source of many, many, many installation problems, at least on Windows and macOS.
We’ve had many problems with Cairo. More than 20% of the reported issues have the "Cairo" word in their comments.
Cairo releases are … sometimes late. #278 is a good example of why it’s been really frustrating to work with its dev team.
Cairo does a lot of things WeasyPrint’s not interested in. Generating PNG is useful for WeasyPrint, but it could be done with a PDF-to-PNG converter. Cairo is complex, it will probably never get new PDF-only features soon (the latest stable version is the first one providing metadata and links, for example).
Pango should be useless for us. We use it to break lines, but HTML has requirements that are really different from "normal" use cases. That’s why we have a lot of workarounds for texts. We should use Harfbuzz instead, and break lines using a custom algorithm, just as other browsers do. See #301, for example.

So. Here’s what I think.

Using a "real" PDF generator would be hard but not impossible. I don’t really like ReportLab for many reasons, but something like that would be really useful.
Having a real line-breaking algorithm would make Pango useless.
FontConfig is really convenient for Pango, but it should be used only on Linux where it’s the standard library. We could probably rely on macOS and Windows APIs to find fonts (what do other browsers do?).
We have to keep HarfBuzz.

hejsan commented 4 years ago

Ok, I understand and agree with your points.

I don’t really like ReportLab for many reasons

I agree to steer away from any freemium solutions as they tend to become a liability down the road when they refuse to push features to their "community" versions.

Do you see this new PDF generator as a separate project or would it be part of WeasyPrint?

liZe commented 4 years ago

I agree to steer away from any freemium solutions as they tend to become a liability down the road when they refuse to push features to their "community" versions.

:+1:

Do you see this new PDF generator as a separate project or would it be part of WeasyPrint?

It can be a separate project, with a quite low-level API. The hard part is probably to handle fonts, by creating a PangoCairo equivalent.

(If anyone knows how to convert PDF to PNG in pure Python, that would be useful too :unamused:.)

hejsan commented 4 years ago

I found an opensource PDF/A conformance checker that is pretty cool: https://verapdf.org/ (Download here: https://verapdf.org/software/) There's both a simple gui for checking individual files and also a commandline that can be used for automatic testing. It gives you a simple breakdown report that links to details about each error (they're all hosted in this list: https://github.com/veraPDF/veraPDF-validation-profiles/wiki/PDFA-Part-1-rules

liZe commented 4 years ago

I found an opensource PDF/A conformance checker that is pretty cool: https://verapdf.org/

That’s really cool, thanks!

It gives you a simple breakdown report that links to details about each error (they're all hosted in this list: https://github.com/veraPDF/veraPDF-validation-profiles/wiki/PDFA-Part-1-rules

That’s really impressive.

Having PDF/A conformance is probably one of the best features we can get once we have a new PDF generator. I’m currently working on that :wink:. (That = the generator, not the PDF/A conformance yet)

hejsan commented 4 years ago

I’m currently working on that

Cool, do you have an open repo for it yet? I had been pondering the same. Thinking out loud the PDF/A conformance has to be an option as it would impact speed and available features?

malnajdi commented 4 years ago

@liZe is teasing a lot about this new generator. If you need help let me know 😄

oleg-medovikov commented 3 years ago

How is it going?

liZe commented 3 years ago

How is it going?

Pretty well! The new PDF generator (called pydyf) is now used in master, almost everything works fine (missing SVG images support is the major point we have to fix).

An online PDF validator thinks that many PDF files we generate are already PDF/A compliant, but I suppose that we still have a lot of work (for tags at least, I think). We have to check with veraPDF too.

As explained in #1232, the next step is to have a master branch with the same features as the current stable versions. When it’s done, we’ll have to implement features that people want, but we can’t do everything at the same time :wink:. And before that, we’re currently implementing sponsored features (#247, #1057) that may also be useful for you!

guidocioni commented 3 years ago

How is it going?

Pretty well! The new PDF generator (called pydyf) is now used in master, almost everything works fine (missing SVG images support is the major point we have to fix).

An online PDF validator thinks that many PDF files we generate are already PDF/A compliant, but I suppose that we still have a lot of work (for tags at least, I think). We have to check with veraPDF too.

As explained in #1232, the next step is to have a master branch with the same features as the current stable versions. When it’s done, we’ll have to implement features that people want, but we can’t do everything at the same time 😉. And before that, we’re currently implementing sponsored features (#247, #1057) that may also be useful for you!

If I get the latest version from Conda is this already inside? Because I've been trying to produce quite simple (no images or weird components) PDF/A compliant files and from the file info I can see that the version is only 1.5 and they're not PDF/A compliant. :( So maybe the version that I'm using (52.4) still does not include pydyf support?

grewn0uille commented 3 years ago

Hello @guidocioni!

The latest version on Conda (52.5) doesn’t include pydyf. All 52.x versions are using (and will use) Cairo.

Currently there is no release working with pydyf, but the current master branch uses it so you can give it a try if you want 😀

guidocioni commented 3 years ago

Hello @guidocioni!

The latest version on Conda (52.5) doesn’t include pydyf. All 52.x versions are using (and will use) Cairo.

Currently there is no release working with pydyf, but the current master branch uses it so you can give it a try if you want 😀

Would be good, the problem is that where I'm deploying this I can only use conda to install anything :D Is there a way to install the master with conda? As you can imagine also converting a PDF to PDF/A using solely conda/python installation is kind of a nightmare :D

grewn0uille commented 3 years ago

I don’t think there is an easy way to install the master branch directly with Conda, but you can use pip in a Conda environment and so install the master branch with pip.

guidocioni commented 3 years ago

I don’t think there is an easy way to install the master branch directly with Conda, but you can use pip in a Conda environment and so install the master branch with pip.

eh eh I wish it would be so easy. Unfortunately I can only give a list of dependencies to install through conda forge and access a Python environment running with Spark. No access to pip or the underlying unix system. Thanks for the help anyway! I hope someday this will make its way in the stable release

guidocioni commented 3 years ago

@grewn0uille I managed to install the latest 53.0b1 version (which uses pydyf) in our system and produce a PDF. When looking in the file info I can see it was generated according to the 1.7 standard but when checking in the online validator unfortunately I get these errors:

The value of the key Flags is 8 but must be either symbolic or non-symbolic.
The value of the key Flags is 10 but must be either symbolic or non-symbolic.
The value of the key Flags is 8 but must be either symbolic or non-symbolic.
The document does not conform to the requested standard.
The document contains fonts without embedded font programs or encoding information (CMAPs).
The document doesnot conform to the PDF 1.7 standard.

any idea where are those coming from?

liZe commented 3 years ago

any idea where are those coming from?

They come from a bug that’s just been fixed by f804d59. Thanks a lot for the report!

guidocioni commented 3 years ago

any idea where are those coming from?

They come from a bug that’s just been fixed by f804d59. Thanks a lot for the report!

Nice, when implementing f804d59 the PDF is now 1.7 conform but still not PDF/A compliant :(

liZe commented 3 years ago

Nice, when implementing f804d59 the PDF is now 1.7 conform but still not PDF/A compliant :(

And there’s a long way ahead… But at least now we can generate the PDF we want.

grewn0uille commented 3 years ago

Hello!

(The survey is now closed. Thanks for all your answers! We’ll share the results soon 😉)

If you’re interested in PDF/A compliance, we created a short survey where you can give a boost to this feature and help us to improve WeasyPrint 😉

Vote for it!

guidocioni commented 3 years ago

Nice, when implementing f804d59 the PDF is now 1.7 conform but still not PDF/A compliant :(

And there’s a long way ahead… But at least now we can generate the PDF we want.

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things... Or is it just something that can be controlled right now.

liZe commented 3 years ago

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things... Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

guidocioni commented 3 years ago

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things... Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

winklemint commented 7 months ago

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things... Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

Hi can you share this code like how are you converting an existing pdf to PDF/A using ghost script as i am trying it is not working for me

guidocioni commented 7 months ago

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things... Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

Hi can you share this code like how are you converting an existing pdf to PDF/A using ghost script as i am trying it is not working for me

This is something I used in the past but I'm not sure it is still working now

import subprocess
import os

def convert_to_pdfa(sourceFile, targetFile):
    ghostScriptExec = ['gs', '-dPDFA', '-dBATCH', '-dNOPAUSE', 
                      '-sColorConversionStrategy=UseDeviceIndependentColor',
                      '-sDEVICE=pdfwrite', '-dPDFACompatibilityPolicy=2']
    # because of a ghostscript bug, which does not allow parameters that are longer than 255 characters
    # we need to perform a directory changes, before we can actually return from the method
    cwd = os.getcwd()
    os.chdir(os.path.dirname(targetFile))
    try:
        subprocess.check_output(ghostScriptExec +
                                ['-sOutputFile=' + os.path.basename(targetFile), sourceFile])
    except subprocess.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(
            e.cmd, e.returncode, e.output))
    os.chdir(cwd)

winklemint commented 7 months ago

So is there no way to force pydyf to produce a PDF/A compliant file? Like changing default fonts or other things... Or is it just something that can be controlled right now.

It can’t be controlled right now, at least without code being added to WeasyPrint. pydyf is theoretically able to generate PDF/A-compliant PDFs, but the current code of WeasyPrint doesn’t follow the PDF/A rules. pydyf is only the first step.

Ok thanks. For the moment I'm using ghostscript piping input and output, where the input is a temporary file where weasyprint writes and the output is in the filesystem, to directly convert what's coming out of weasyprint to PDF/A but of course it would be amazing to have such a feature built-in the tool. Anyway keep up the good work!

Hi can you share this code like how are you converting an existing pdf to PDF/A using ghost script as i am trying it is not working for me

This is something I used in the past but I'm not sure it is still working now
import subprocess
import os

def convert_to_pdfa(sourceFile, targetFile):
    ghostScriptExec = ['gs', '-dPDFA', '-dBATCH', '-dNOPAUSE', 
                      '-sColorConversionStrategy=UseDeviceIndependentColor',
                      '-sDEVICE=pdfwrite', '-dPDFACompatibilityPolicy=2']
    # because of a ghostscript bug, which does not allow parameters that are longer than 255 characters
    # we need to perform a directory changes, before we can actually return from the method
    cwd = os.getcwd()
    os.chdir(os.path.dirname(targetFile))
    try:
        subprocess.check_output(ghostScriptExec +
                                ['-sOutputFile=' + os.path.basename(targetFile), sourceFile])
    except subprocess.CalledProcessError as e:
        raise RuntimeError("command '{}' return with error (code {}): {}".format(
            e.cmd, e.returncode, e.output))
    os.chdir(cwd)

Hi thanks for this solution I tried with different policy and multiple changes to make the file PDF/A-3B compliant and Vera PDF validated it I am trying to look for a way to attach an XML to it like embedd and XML in it to make this with Factur-X standard, Any suggestion or help is highly appreciated. Thanks

FelixSchwarz commented 3 months ago

I am trying to look for a way to attach an XML to it like embedd and XML in it to make this with Factur-X standard, Any suggestion or help is highly appreciated.

@winklemint WeasyPrint does not use GitHub discussions but maybe you can open an issue about Factur-X support. My idea is to gather snippets and advice how to generate Factur-X PDFs using WeasyPrint.

Kozea / WeasyPrint

Generating PDF/A conforming PDFs #630