bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

stuck warcs #18

Closed wwaites closed 4 years ago

wwaites commented 5 years ago

These are the warc files corresponding to the processes that haven't finished (24 out of 2000). There is a possibility that the reason is not a bug in the software but issues on valhalla, though I would have expected any I/O problems to have manifested with the processes exiting with an error rather than hanging around indefinitely.

/data/internet_archive/wide00006/WIDE-20121028153429-crawl336/WIDE-20121028153450-05571.warc.gz
/data/internet_archive/wide00006/WIDE-20121017085808-crawl425/WIDE-20121017104118-02484.warc.gz
/data/internet_archive/wide00006/WIDE-20121102011838-crawl338/WIDE-20121102020243-05888.warc.gz
/data/internet_archive/wide00006/WIDE-20121023153717-crawl339/WIDE-20121023164013-03400.warc.gz
/data/internet_archive/wide00006/WIDE-20120920092931-crawl422/WIDE-20120920094504-00082.warc.gz
/data/internet_archive/wide00006/WIDE-20120924054320-crawl418/WIDE-20120924065833-00437.warc.gz
/data/internet_archive/wide00006/WIDE-20121001083328-crawl413/WIDE-20121001083328-00448.warc.gz
/data/internet_archive/wide00006/WIDE-20121006165216-crawl417/WIDE-20121006170546-00517.warc.gz
/data/internet_archive/wide00006/WIDE-20121023153717-crawl339/WIDE-20121023164011-03399.warc.gz
/data/internet_archive/wide00006/WIDE-20120928193306-crawl411/WIDE-20120929002956-00592.warc.gz
/data/internet_archive/wide00006/WIDE-20121019204102-crawl335/WIDE-20121019211212-03260.warc.gz
/data/internet_archive/wide00006/WIDE-20121017183734-crawl413/WIDE-20121017183734-02121.warc.gz
/data/internet_archive/wide00006/WIDE-20121029021638-crawl417/WIDE-20121029025635-03685.warc.gz
/data/internet_archive/wide00006/WIDE-20120929110813-crawl422/WIDE-20120929111739-00829.warc.gz
/data/internet_archive/wide00006/WIDE-20121101095730-crawl420/WIDE-20121101111421-04481.warc.gz
/data/internet_archive/wide00006/WIDE-20120919235735-crawl411/WIDE-20120920020039-00057.warc.gz
/data/internet_archive/wide00006/WIDE-20121116063602-crawl420/WIDE-20121120201438-00002.warc.gz
/data/internet_archive/wide00006/WIDE-20121008154025-crawl427/WIDE-20121008161034-01215.warc.gz
/data/internet_archive/wide00006/WIDE-20120919203346-crawl338/WIDE-20120919231120-00025.warc.gz
/data/internet_archive/wide00006/WIDE-20121017151105-crawl412/WIDE-20121017153711-02301.warc.gz
/data/internet_archive/wide00006/WIDE-20120924092521-crawl420/WIDE-20120924094718-00457.warc.gz
/data/internet_archive/wide00006/WIDE-20121008000636-crawl425/WIDE-20121008005714-01299.warc.gz
/data/internet_archive/wide00006/WIDE-20121014090919-crawl414/WIDE-20121014100203-01882.warc.gz
/data/internet_archive/wide00006/WIDE-20121013192915-crawl425/WIDE-20121013201235-02002.warc.gz
wwaites commented 5 years ago

Manually confirmed with

/data/internet_archive/wide00006/WIDE-20121023153717-crawl339/WIDE-20121023164013-03400.warc.gz

the last few lines of the output are:

File: Input Stream, Start extract
Oct 29, 2019 12:27:08 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init> WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
Oct 29, 2019 12:27:08 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init> WARNING: Using fallback font 'LiberationSerif-Bold' for 'TimesNewRomanPS-BoldMT'
File: Input Stream, Extract success.
File: Input Stream, Start extract
Oct 29, 2019 12:27:12 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init> WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
File: Input Stream, Extract success.
File: Input Stream, Start extract

and there it rests, having started extraction but never finishing.

lpla commented 5 years ago

Are you using Bitextor latest master version and Python requirements? We fixed a bug last week with the stdout messages in the python-pdfextract wrapper.

wwaites commented 5 years ago

Maybe not, I'll try updating. The testing loop is necessarily quite long here because of trying to run over a large volume of data.

wwaites commented 5 years ago

Updated and looks to me like it is still stuck, though this time without the helpful messages suggesting that pdf-extract is the place to look.

dionwiggins commented 5 years ago

Hi William, my team does not have access to the warc files or the pipeline. Are you able to determine which file or files it is getting stuck on when calling PDFExtract? If you can provide the file, we can trace it.

wwaites commented 5 years ago

A better solution would be to give them access. That way they can reproduce this problem as well as #19.

Please send me (in email, not in the ticket) desired usernames and corresponding ssh keys

lpla commented 4 years ago

Hi. Was this further tested?

dionwiggins commented 4 years ago

New version using Poppler makes the PDFBox version no longer relevant. Closing this one.