daitss / core

DAITSS: Dark Archive In The Sunshine State
GNU General Public License v3.0
9 stars 2 forks source link

Error while process PDF #781

Open szanati opened 8 years ago

szanati commented 8 years ago

I received the follow error on a package with a pdf file:

error while processing 1(sip-files/09-06-2013.pdf): bad status http://transform.fda.fcla.edu/transform/pdf_norm?location=file:/var/daitss/data/work/ETAL9VQ5Q_V6OA41/files/original/1/data: 500 /opt/pdfapilot/pdfaPilot /var/daitss/data/work/ETAL9VQ5Q_V6OA41/files/original/1/data --fontfolder=/usr/share/fonts/msttcorefonts/ --onlypdfa --substitute --outputfile=/var/daitss/tmp/d20160317-22104-1k0gniu/data/transformed.pdf --report=XML,IFNOPDFA,PATH=/var/daitss/tmp/d20160317-22104-1k0gniu/pdfapilot_report.xml failed, output: Input /var/daitss/data/work/ETAL9VQ5Q_V6OA41/files/original/1/data
Pages 32
PDFA Regular Progress 100 % Summary Corrections 0 Summary Errors 0 Summary Warnings 0 Summary Infos 0 Duration 00:05
Error 1010 The PDF file may be corrupt (unable to open PDF file).

szanati commented 8 years ago

I ran the pdf thru the GUI description service and it said the it was Well-Formed and valid and event outcome was a success. I tried the package on Ripple and it received the same error. On ripple I also tried editing the daitss-config.yml file under the transform_service I changed "skip_undefined" from false to true and it went thru the pdf steps it did not archive due to another issue on Ripple which will be handled next week involving squid.

szanati commented 8 years ago

The package on production is in the stashspace named: Github_781. It is in the directory: /var/daitss/data/stash/Github_781/ETAL9VQ5Q_V6OA41. On Ripple its in the workspace: /var/daitss/data/work/ENF28E4YI_X7LTMP. On ripple the original package is in: /var/daitss/ops/stephen/AA00038892_00002

cchou commented 8 years ago

This package fails with PDF to PDF/A conversion with PdfaPilot. Would need to submit an issue ticket to PdfaPilot vendor.

Alternatively, you can try to get this package ingested by turning off pdfa normalization.

cchou commented 8 years ago

Here is the instruction, https://github.com/daitss/core/wiki/Turn-off-PDF-to-PDFA-normalization

lydiam commented 7 years ago

Email from Carol:

Response from callas. Looks like you can fix those PDFs with PDFapilot, though I am not sure how you want to pursue it seems it means the SIPs will be changed.

-Carol ---------- Forwarded message ---------- From: callas software support 3rdlevelsupport@callassoftware.com Date: Fri, Apr 21, 2017 at 8:21 AM Subject: Re: Problems with many PDF files using PDFaPilot To: "cchoufl@gmail.com" cchoufl@gmail.com

Hello Carol,

as David has already mentioned the cases have underlying issues, however, in both cases the PDF structure seems to be corrupt. Acrobat is still able to display the file, however the more thorough analysis with the PDF/A validator/converter fails. We will further investigate to make sure that this assumption is correct.

There is, however, already a known workaround for that problem: Both files can actually be converted when they are first converted to PostScript and back to PDF. You can do so by using ./pdfaPilot --redistill on command line.

Would that work for you as a - at least temporary - solution?

Best regards, Dietrich

--------------- Original Message --------------- From: callas software support team [support@callassoftware.com] Sent: 19.04.2017 21:15 To: cchoufl@gmail.com; d.seggern@callassoftware.com Subject: Re: Problems with many PDF files using PDFaPilot

Hi Carol,

I've reproduced the problem for both files. The underlying cause appears to be different for both files, they will be looked at by development to determine what is causing this and whether anything can be done about it.

I'll keep you posted! David.

--------------- Original Message --------------- From: carol chou [cchoufl@gmail.com] Sent: 19/04/2017 7:50 To: d.seggern@callassoftware.com Subject: Re: Problems with many PDF files using PDFaPilot

Hi Dietrich,

Our sys admin has installed the new version of PDFaPIlot, . Some of the problem files can now ben converted but the following two still give out errors during the conversion:

http://www.fcla.edu/daitss-test/files/00004-04-2009.pdf http://www.fcla.edu/daitss-test/files/00004-04-2009.pdf

Progress 100 %

Errors 16660 Device process color used but no PDF/A OutputIntent

Errors 114 Font not embedded (and text rendering mode not 3)

Errors 24 Annotation has no Flags entry

Errors 24 Annotation not set to print

Errors 6280 CharSet missing for Type 1 font

Summary Corrections 72

Summary Errors 23102

Summary Warnings 0

Summary Infos 0

Duration 00:54

Error 1000 Unknown error (unknown exception)

http://www.fcla.edu/daitss-test/files/09-06-2013.pdf http://www.fcla.edu/daitss-test/files/09-06-2013.pdf [cchou@ripple GH_781]$ /opt/pdfapilot-6.2.256/pdfaPilot 09-06-2013.pdf --fontfolder=/usr/share/fonts/msttcorefonts/ --onlypdfa --substitute --outputfile=09-06-2013-o.pdf --report=XML,IFNOPDFA,PATH=report.xml

Serialization This pdfaPilot instance is running with a Coldspare or Developer license and may only be used in production as a temporary replacement for a full license on another computer.

Input /home/cchou/pdfaError/GH_781/09-06-2013.pdf

Pages 32

PDFA Regular

Progress 100 %

Summary Corrections 0

Summary Errors 0

Summary Warnings 0

Summary Infos 0

Duration 00:01

Error 1010 The PDF file may be corrupt (unable to open PDF file).

Here is the pdfapilot version the sys admin has installed for us. callas pdfaPilot CLI 6.2.256 (x64)

2000-2016 callas software gmbh

Can you take a look again and provide us some solutions?

Thanks,

-Carol

On Mon, Oct 10, 2016 at 5:09 AM, Dietrich von Seggern <d.seggern@callassoftware.com mailto:d.seggern@callassoftware.com> wrote: Hi Carol,

what version of pdfaPilot are you using?

I was not able to reproduce any issues with the current release (callas pdfaPilot CLI 6.0.245 (x64)) on a Mac. The reason my either be the font situation or the version.

Best regards, Dietrich

-- Dietrich von Seggern | Managing Director callas software GmbH | Schönhauser Allee 6/7 | 10119 Berlin | Germany Tel +49.30.44390310 <tel:+49%2030%2044390310> | Fax +49.30.4416402 <tel:+49%2030%204416402> | www.callassoftware.com http://www.callassoftware.com/ Amtsgericht Charlottenburg, HRB 59615 | Geschäftsführung: Olaf Drümmer, Ulrich Frotscher, Dietrich von Seggern

Meet us at:

callas VIP Event, Berlin: November 7 - 8 (+ 9) https://en.xing-events.com/vip2016.html https://en.xing-events.com/vip2016.html

PDF Day Australia, Sydney: November 25 https://en.xing-events.com/PDFday-Australia.html https://en.xing-events.com/PDFday-Australia.html

On 9 Oct 2016, at 03:35, carol chou <cchoufl@gmail.com mailto:cchoufl@gmail.com> wrote:

Hi Mr. Seggern,

I am working with Florida Virtual Campus who has been using PDFaPilot to convert the PDF in their archive into PDFA. Recently, we have run into some PDFAPIlot errors with some of the PDFs in the archive. Can you please see if this is something that PDFAPilot can fix? The PDFs can be download at

http://www.fcla.edu/daitss-test/files/SCV20100314.pdf http://www.fcla.edu/daitss-test/files/SCV20100314.pdf

http://www.fcla.edu/daitss-test/files/00004-04-2009.pdf http://www.fcla.edu/daitss-test/files/00004-04-2009.pdf

http://www.fcla.edu/daitss-test/files/09-06-2013.pdf http://www.fcla.edu/daitss-test/files/09-06-2013.pdf

FYI, I am enclosing the pdfaPilot error at the end of my email too.

Thanks,

Carol

-- David van Driessche Mail: david.van.driessche@fourpees.com Cell: +32 473 89 44 46 Skype: david-van-driessche

Four Pees Nijverheidskaai 14 9040 Sint-Amandsberg, Belgium

www.fourpees.com ref:_00D201c3C._500w01bNASQ:ref

lydiam commented 7 years ago

Do we still have the original SIPs? We may need to fix the PDFs in the original SIPs (in consultation with their owners) and resubmit and abort the stashed SIPs with corrupt files. We'll need to discuss this.

lydiam commented 7 years ago

This is worth emailing UF about, since they seem to have done multiple submissions of 3 different package names. They may need to authorize that we 'abort' some of the duplicates, and then we'll have fewer problem packages to deal with. Determine if we still have the SIPs. If we do, we should experiment with correcting one of the problem PDFs with PDF/A pilot by converting to PDF/A and back to PDF. Based on the results of this investigation decide how to proceed.

lydiam commented 7 years ago

I did some validation of the PDFs remaining in the DAITSS Github_781 stashspace using description.fcla.edu. The results:

So it appears that the valid and well-formed PDFs may archive if the PDF/A Pilot is turned off. UF may need to recreate the other two.

Carol - can you confirm my conclusions?

lydiam commented 7 years ago

I attempted to obtain details about the validity of the 4 remaining PDFs from Adobe Acrobat 9's Preflight feature but didn't have much success.

szanati commented 7 years ago

The original packages for this issue: AA00038892_00002, AA00047064_00008, and UF00098620_00421 are in: /var/daitss/ops/exceptions/tickets/GitHub_781 on darchive.