bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

pdf-extract in warc2htmlwarc uses >1 processor #17

Closed wwaites closed 4 years ago

wwaites commented 5 years ago

Watching with top shows it using up to 200% of a processor. In the envisaged production pipeline, this is not ideal as it might lead to oversubscription of CPU resources. On the other hand, the OS scheduler ought to ration the CPU to provide sensible concurrency effectively limiting to 100% on a loaded machine.

lpla commented 5 years ago

We reproduced this issue and it is an issue in https://github.com/bitextor/pdf-extract Java code, as it happens even calling via command line to PDFextract.jar .

Transferring the issue.

dionwiggins commented 5 years ago

We are tracing it now. We don't have the means to reproduce the issue but suspect that the Apache PDFBox may be spinning up a thread. We are investigating at present.

lpla commented 4 years ago

Any progress on this?

dionwiggins commented 4 years ago

Hi Leo,

We are moving to Poppler. Mui is aiming to finish the full rewrite for Poppler, with language ID, sentence join and the relevant document adjustments within this week. As such, we don’t plan to fix this issue as we will discontinue using PDFBox.

Regards,

Dion Wiggins Founder and CTO Omniscien Technologies

Phone: +66 (8) 7086 3353 Fax: +66 (2) 662 4728, +66 (2) 662 4727 Skype: dionwiggins Email: dion.wiggins@omniscien.com Web: http://www.omniscien.com

NOTICE: This e-mail (including all information transmitted with it) is for the intended addressee only. It may contain information that is confidential, proprietary and/or legally privileged. No confidentiality, ownership right or privilege is waived or lost by any mistransmission, redirection or interception. No one other than the intended addressee may read, print, store, copy, forward or act in reliance upon this e-mail. If you are not the intended addressee: (a) any use, dissemination, printing or copying of this e-mail is strictly prohibited and may be a breach of confidence, and (b) kindly notify the sender by e-mail immediately and delete and destroy all copies of this e-mail in your possession.

From: Leopoldo Pla notifications@github.com Sent: Monday, January 13, 2020 7:08 PM To: bitextor/pdf-extract pdf-extract@noreply.github.com Cc: dionwiggins dion.wiggins@omniscien.com; Comment comment@noreply.github.com Subject: Re: [bitextor/pdf-extract] pdf-extract in warc2htmlwarc uses >1 processor (#17)

Any progress on this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

dionwiggins commented 4 years ago

New version using Poppler makes the PDFBox version no longer relevant. Closing this one.