danny0838 / webscrapbook

A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
Mozilla Public License 2.0
868 stars 119 forks source link

First time user questions and issues #216

Closed sinthome closed 2 years ago

sinthome commented 3 years ago

Hi, I'm trying to use the WebScrapBook with Firefox to capture a siterip of a largish website. I have the link depth set to 3 and tried to limit it to only the domains I wanted, but I must have neglected to do so properly and now it seems to have captured everything I wanted but is still running on some random links that are too far afield. Additionally, there is still nothing saved to the download folder. Is there a way to cancel the current operation and still recover what was already captured? Is it in some temp file location? The save format is set to HTZ but is there a way to just recover the uncompressed folder version and manually prune out the extraneous captures? I think it is probably 20gb+ of material already.

danny0838 commented 3 years ago

Is there a way to cancel the current operation and still recover what was already captured? The save format is set to HTZ but is there a way to just recover the uncompressed folder version and manually prune out the extraneous captures? I think it is probably 20gb+ of material already.

No and no.

Since everything is messed up due to incorrect options, the only safe way is to perform the capture again after fixing the options.

Say there's a page at http://example.com/foo.html with such content:

<a href="path/bar.html">link</a>

If bar.html is excluded from in-depth capture, the page should be rewritten as:

<a href="http://example.com/path/bar.html">link</a>

While if bar.html is included in in-depth capture, the page should be rewritten as:

<a href="bar.html">link</a>

Since links in all pages needs rewritten, you probably can't correct everything easily by merely deleting the bar.html after you find it incorrectly included.

Is it in some temp file location?

They are stored in the internal browser storage, but may be cleared after the capture has completed or when a new browser session starts. Anyway, due to the above reason it's not recommended to craft from it. Perform the capture again is the only way to get everything work right.


If you have a problem on setting the filter and need some assistant, export the options and include it here.

sinthome commented 3 years ago

Thanks for the quick reply, I will try again. I honestly am not at all knowledgeable in html code or the standard expressions that you use, so if you can help with some completely novice explanations that would be much appreciated.

If I want to capture everything at a domain, what is the best approach? Should I use the "Include these URLs" box? Do I need to specify anything in the URL filter? The website I am trying to archive has several different categories and numerous articles and a message forum that are indexed in a "blog" type format with up to a thousand pages with "page/1" to "/page/1000" addresses. Does that mean I need to set the link depth to 1000?

danny0838 commented 3 years ago

Thanks for the quick reply, I will try again. I honestly am not at all knowledgeable in html code or the standard expressions that you use, so if you can help with some completely novice explanations that would be much appreciated.

If I want to capture everything at a domain, what is the best approach? Should I use the "Include these URLs" box?

Yes, you probably need to set it to filter wanted pages.

You may need to search for a good regular expression syntax tutorial (e.g. https://refrf.shreyasminocha.me/ may be a good start) and learn about it. Regular expression is quite "standardized", powerful, and widely used in many programming languages and softwares, but it does have a learning curve for beginners.

For a simple "include all urls under example.com domain" case you can use as simple as /^http://example\.com//, as the tooltip says. But you still need some basic knowledge about regular expression to prevent getting an unexpected result.

Do I need to specify anything in the URL filter?

It depends on your use case. This is to filter out some unwanted URLs that would normally be included by the "Include these URLs".

The website I am trying to archive has several different categories and numerous articles and a message forum that are indexed in a "blog" type format with up to a thousand pages with "page/1" to "/page/1000" addresses. Does that mean I need to set the link depth to 1000?

Strictly speaking, it would be dependent on how many pages an index page includes (for example, if an index page has 1, 2, ..., 20, you probably need a depth 1000/20 = 50). But you can simply set 1000 if you don't want to do the complicated math calculation.

However, in-depth capture of WebScrapBook is designed to capture only several related pages together. It's not designed to mirror a whole site and there may be some potential issues if you really do that. You should also save as folder rather than htz for a large capture, as the browser probably can't compress 20GB data files due to limited/restricted memory. All in all, it's generally more recommended to use a specialized web site mirroring tool for such case.

sinthome commented 3 years ago

Great, thanks. That's enough info for me to tinker and see what I can accomplish. The mirroring software I've tried doesn't seem to produce a very accurate result. When I use WebScrapBook on a smaller sample size it performs excellent, so I am hopeful.

sinthome commented 3 years ago

An update on my experiment-- I adjusted the settings and tried to do the full site but it ran for many hours and then crashed. Tried again with a more limited depth and filtering some unimportant pages (particularly all the user profile links) and it seems to have finished capturing but is still currently "rebuilding links" for at least the past few hours and firefox is using a huge amount of memory. We will see if it maxes out on ram and crashes again.

danny0838 commented 3 years ago

This is possible. A browser may have a restriction on memory or other, causing it not capable of capturing a very large amount of pages. Most specialized site crawler use an advanced DBMS during the capture, which is unfortunately not supported by browsers.

However, a text-based site is very unlikely to go beyond GBs of data. It may be possible that you have specified a bad pattern causing too many unwanted pages be included to get 20GBs. Improve your RegExp may solve the problem.