Closed wwaites closed 4 years ago
We reproduced this issue and it is an issue in https://github.com/bitextor/pdf-extract Java code, as it happens even calling via command line to PDFextract.jar
.
Transferring the issue.
We are tracing it now. We don't have the means to reproduce the issue but suspect that the Apache PDFBox may be spinning up a thread. We are investigating at present.
Any progress on this?
Hi Leo,
We are moving to Poppler. Mui is aiming to finish the full rewrite for Poppler, with language ID, sentence join and the relevant document adjustments within this week. As such, we don’t plan to fix this issue as we will discontinue using PDFBox.
Regards,
Dion Wiggins Founder and CTO Omniscien Technologies
Phone: +66 (8) 7086 3353 Fax: +66 (2) 662 4728, +66 (2) 662 4727 Skype: dionwiggins Email: dion.wiggins@omniscien.com Web: http://www.omniscien.com
NOTICE: This e-mail (including all information transmitted with it) is for the intended addressee only. It may contain information that is confidential, proprietary and/or legally privileged. No confidentiality, ownership right or privilege is waived or lost by any mistransmission, redirection or interception. No one other than the intended addressee may read, print, store, copy, forward or act in reliance upon this e-mail. If you are not the intended addressee: (a) any use, dissemination, printing or copying of this e-mail is strictly prohibited and may be a breach of confidence, and (b) kindly notify the sender by e-mail immediately and delete and destroy all copies of this e-mail in your possession.
From: Leopoldo Pla notifications@github.com Sent: Monday, January 13, 2020 7:08 PM To: bitextor/pdf-extract pdf-extract@noreply.github.com Cc: dionwiggins dion.wiggins@omniscien.com; Comment comment@noreply.github.com Subject: Re: [bitextor/pdf-extract] pdf-extract in warc2htmlwarc uses >1 processor (#17)
Any progress on this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
New version using Poppler makes the PDFBox version no longer relevant. Closing this one.
Watching with top shows it using up to 200% of a processor. In the envisaged production pipeline, this is not ideal as it might lead to oversubscription of CPU resources. On the other hand, the OS scheduler ought to ration the CPU to provide sensible concurrency effectively limiting to 100% on a loaded machine.