Scrapper should suck-in PDF text as well as background images

AriZoneVibes / ServLibScrapper

Download manuals from ServLib into local and convert them to PDF

8 stars 1 forks source link

Scrapper should suck-in PDF text as well as background images #1

Open powerbroker opened 2 years ago

powerbroker commented 2 years ago

Feed scrapper with https://servlib.com/panasonic/telephone/kx-ts2351rub-kx-ts2351ruw.html and compare with the source preview.

There are tons of text labels missing in the resulting PDF.

The ServLibFags render PDF text in <div class="pagedoc"/> tag of their pages. There are ~3 PDF pages rendered when URL has no "?start=${page_number}" parameter, and single PDF page otherwise.

Would be great if Scrapper start with manual web page and suck-in all the required HTML pages(the fags render total amount of PDF pages on their first page in the <dt>Pages</dt><dd>51</dd> construct inside <div class="file_desc"\>), extract <div class="pagedoc"/> from them and join extracted HTML with the backgrounds.

The resulting document may be (zipped) HTML instead of PDF

AriZoneVibes commented 1 year ago

Hi ^^ Glad you are finding use to the script. I got to understand what the issue is. At the time, I didn't realize ServLib stores them in different ways, in this case as PDF instead of PNG files. I'll work on finding how to also grab that information from the website. Good idea with exporting it as HTML, I'll also see to add that function.

AriZoneVibes commented 1 year ago

Seems like everything under the pdfbg element builds the manual page. Probably can grab that, place it inside some HTML to be able to stack all the pages together and based on the user options export as HTML (with the background images downloaded to a local folder) or PDF.

A way to know what method to use automatically would be to make a request looking for an element with a class name of pdfbg. If it exists, download scrapping the PDF, if not look for the PNG images. This might be at a later point to get the new download method out sooner.

Edit: Sorry if I'm basically repeating what you said, I'm just, wrapping my head around it.

robertschulze commented 1 year ago

Maybe easier to automate the retrieval in ChromeDriver using the integrated screenshot functionality on the <div> tag(?)

AriZoneVibes commented 1 year ago

A disadvantage would be that it loses the text, so it cannot be used to search in the PDF.

Right now, I'm leaning more towards:

Download the webpage.
Extract the pagedoc section and child elements.
Throw the HTML elements into a PDF page. Either by extracting them individually and converting them into PDF element or calling another tool to convert HTML to PDF.

I'm still open to this other solution tho. For example, what are the benefits of using Selenium over just requesting the website? Would it better to run it through an OCR than try to extract the text elements from the web page?

robertschulze commented 1 year ago

Yes, that's true but it can be (re)detected using a free online PDF OCR service or, if it has to be in Python, using tesseract. I fear that trying to reconstruct from the HTML and background will be a lot of effort (getting the size, positioning, etc. correct). What could be an alternative is printing the website to PDF, which should maintain the text, and then cropping to the actual manual page.