danny0838 / webscrapbook

A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
Mozilla Public License 2.0
894 stars 119 forks source link

WebScrabBook does only download the preview of a picture #306

Closed Golddouble closed 1 year ago

Golddouble commented 1 year ago

Hello,

When I try to scrap the following page ... Page: https://forum.mxlinux.org/viewtopic.php?t=72747

The result is quite good. I am especially interested in the picture in post #8. And yes it is included in the scrap as you can see, when you have a look into the ZIP.

But this picture is only a preview for the real picture. So when I want to locally archive forum threads of course I need also the whole picture. But it looks like this real picture is not included. When you click on the picture in the scraped site in the ZIP file it does not open the picture from the local archive, it opens the picture in the web.

Question: Is there a possibility to configure webScrabBook in a way, that the whole picture is also included in the local archive?

Would apprecitae some answer. Thank you.

danny0838 commented 1 year ago

Configure Download linked files: and Included file types for downloading linked files: to download linked resource files.

Golddouble commented 1 year ago

Thank you.

This is how my Download linked files: looks like: k20221207-213733

This is the link from the forum post #8 to the full picture: https://forum.mxlinux.org/download/file.php?id=27250&mode=view

When I go to that link and I am logged in, then the whole picture appears in the browser. (But actually I was expecting, that the link would be something like https://forum.mxlinux.org/*.png or *.jpg. . But it isn't. Stange)

How should a line look like, I would have to add into the field "Included file types for downloading linked files" ?

Thank you.

danny0838 commented 1 year ago

Set Download linked files: to Match HTTP header and URL file extension and add a rule with the extension of the filename (Content-Disposition) of the HTTP headers, which is usually the filename when saving the image with the browser. In this case it's probably jpg.

For a rare case that the filename is not available from the HTTP header, you can add a rule with the MIME type (Content-Type) of the HTTP headers, which can be peeked through the DevTools of the browser. In this case it's probably mime:image/jpeg.

Golddouble commented 1 year ago

Cool. Thank you very much for your kind help.

This was successful. I have it now in my local storage.

The only thing I had to make, was to change Download linked files: from Match URL file extension into Match HTTP header and URL file extension . Included file types for downloading linked files: was already configured for all pictures. (Yes, it was jpg.)

Just a question of interest:

Here is the webscrapbook output:

Capturing linked page (1) https://forum.mxlinux.org/app.php/help/faq ...
Capturing linked page (1) https://forum.mxlinux.org/app.php/contactadmin ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=pm&folder=inbox ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=ucp_notifications ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=ucp_notifications&mode=notification_options ...
Capturing linked page (1) https://bugs.mxlinux.org/ ...
Capturing linked page (1) https://github.com/MX-Linux ...
Error downloading file (): URL is empty.
Error downloading file (): URL is empty.
Error downloading file (): URL is empty.
Error downloading file (): URL is empty.
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=pm&mode=compose&action=quotepost&p=703659 ...
Capturing linked page (1) https://forum.mxlinux.org/app.php/post/703659/report ...
Capturing linked page (1) https://screenrec.com/share/ivSTLkY9QG ...
Error downloading file (): URL is empty.
Capturing linked page (1) https://wiki.archlinux.org/title/fan_speed_control ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=pm&mode=compose&action=quotepost&p=703662 ...
Capturing linked page (1) https://forum.mxlinux.org/app.php/post/703662/report ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=pm&mode=compose&action=quotepost&p=703663 ...
Capturing linked page (1) https://forum.mxlinux.org/app.php/post/703663/report ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=pm&mode=compose&action=quotepost&p=703666 ...
Capturing linked page (1) https://forum.mxlinux.org/app.php/post/703666/report ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=pm&mode=compose&action=quotepost&p=703670 ...
Capturing linked page (1) https://forum.mxlinux.org/app.php/post/703670/report ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=pm&mode=compose&action=quotepost&p=703671 ...
Capturing linked page (1) https://forum.mxlinux.org/app.php/post/703671/report ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=pm&mode=compose&action=quotepost&p=703733 ...
Capturing linked page (1) https://forum.mxlinux.org/app.php/post/703733/report ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=pm&mode=compose&action=quotepost&p=703737 ...
Capturing linked page (1) https://forum.mxlinux.org/app.php/post/703737/report ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=pm&mode=compose&action=quotepost&p=703738 ...
Capturing linked page (1) https://forum.mxlinux.org/app.php/post/703738/report ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?i=pm&mode=compose&action=quotepost&p=704428 ...
Capturing linked page (1) https://forum.mxlinux.org/app.php/post/704428/report ...
Capturing linked page (1) https://forum.mxlinux.org/viewonline.php ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?mode=delete_cookies ...
Capturing linked page (1) https://zumaclub.ru/ ...
Capturing linked page (1) https://www.phpbb.com/ ...
Error downloading file (https://www.phpbb.com/community/styles/prosilver/theme/images/bg_button.gif): 404 Not Found
Error downloading file (https://www.phpbb.com/assets/css/gradient2b.gif): 404 Not Found
Error downloading file (https://www.phpbb.com/assets/images/headers/header_changelog.jpg): 404 Not Found
Error downloading file (https://www.phpbb.com/assets/images/headers/header_modx.jpg): 404 Not Found
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?mode=privacy ...
Capturing linked page (1) https://forum.mxlinux.org/ucp.php?mode=terms ...
Rebuilding links...
Saving data...
Saved to "~/Downloads/WebScrapBook/data/20221208095450747/index.html"
Done.

The original internet source of this image was https://forum.mxlinux.org/download/file.php?id=27250&mode=view. I was expecting to find this path/address in this output list, because this is the list with all addresses, he has tried to download anything. Isn't it? Why do I not find it there?

danny0838 commented 1 year ago

Only linked pages are shown in the log. Resource files are not.

Golddouble commented 1 year ago

OK. Thank you.