danny0838 / webscrapbook

A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
Mozilla Public License 2.0
908 stars 121 forks source link

Performace issue (eating RAM) #383

Open mbnoimi opened 5 months ago

mbnoimi commented 5 months ago

Hi,

I'm downloading a website with 3 depth in the same domain. My laptop RAM is 16 GB Withing less than 3 hours, the extension ate my RAM to 90% Which forced me to force restart my laptop. This issue occurs with big websites only (my website size about 1.5 GB mostly pure html)

Is there any workaround for enhancing the performance?

danny0838 commented 5 months ago

There's probably not too much you can do besides upgrading the hardware. It may be more performant by saving to the backend server in some cases, though.

mbnoimi commented 5 months ago

There's probably not too much you can do besides upgrading the hardware. It may be more performant by saving to the backend server in some cases, though.

I use WebHTTrack it works pretty fine but for some reason my cookies doesn't work fine. For that I use webscrapbook because it deals with cookies behind the scenes.

mbnoimi commented 5 months ago

There's probably not too much you can do besides upgrading the hardware

BTW, Why webscrapbook stores all the scrapped data in the memory then save them in the last step? Why it doesn't save them one by one just like wget and httrack?

danny0838 commented 5 months ago

BTW, Why webscrapbook stores all the scrapped data in the memory then save them in the last step? Why it doesn't save them one by one just like wget and httrack?

This is not true. Intermediate data is mostly saved to the browser storage, which is ultimately in the disk in some form.

The browser extension API is so limited that it cannot load files that are downloaded to the local filesystem. When capturing multiple web pages, the saved pages need to be loaded and have all links to other downloaded pages rewritten, which is not possible before all pages have been downloaded. As a result, we have to save all downloaded pages in the browser storage, rewrite them, and then save to the local filesystem.