Closed wwaites closed 4 years ago
Manually confirmed with
/data/internet_archive/wide00006/WIDE-20121023153717-crawl339/WIDE-20121023164013-03400.warc.gz
the last few lines of the output are:
File: Input Stream, Start extract
Oct 29, 2019 12:27:08 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init> WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
Oct 29, 2019 12:27:08 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init> WARNING: Using fallback font 'LiberationSerif-Bold' for 'TimesNewRomanPS-BoldMT'
File: Input Stream, Extract success.
File: Input Stream, Start extract
Oct 29, 2019 12:27:12 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init> WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
File: Input Stream, Extract success.
File: Input Stream, Start extract
and there it rests, having started extraction but never finishing.
Are you using Bitextor latest master version and Python requirements? We fixed a bug last week with the stdout messages in the python-pdfextract
wrapper.
Maybe not, I'll try updating. The testing loop is necessarily quite long here because of trying to run over a large volume of data.
Updated and looks to me like it is still stuck, though this time without the helpful messages suggesting that pdf-extract is the place to look.
Hi William, my team does not have access to the warc files or the pipeline. Are you able to determine which file or files it is getting stuck on when calling PDFExtract? If you can provide the file, we can trace it.
A better solution would be to give them access. That way they can reproduce this problem as well as #19.
Please send me (in email, not in the ticket) desired usernames and corresponding ssh keys
Hi. Was this further tested?
New version using Poppler makes the PDFBox version no longer relevant. Closing this one.
These are the warc files corresponding to the processes that haven't finished (24 out of 2000). There is a possibility that the reason is not a bug in the software but issues on valhalla, though I would have expected any I/O problems to have manifested with the processes exiting with an error rather than hanging around indefinitely.