danny0838 / webscrapbook

A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
Mozilla Public License 2.0
850 stars 118 forks source link

Editing after capture does not remove files referenced only by deleted elements #351

Open iiv3 opened 11 months ago

iiv3 commented 11 months ago

Capture a page with images. Open the captured page from scrapbook sidebar. Use "DOM Eraser" from WSB toolbar to remove (some of) the images. Use "Save" from the WSB toolbar.

Even if the "index.html" no longer references the images, the image files remain in place. It happens with "folder" and "htz" with backend server.

If editing is done before capture, the erased image files are never stored.

However it is not always possible to remove all unwanted elements on live website who have JS code that constantly loads more and more stuff.


The original Scrapbook used to remove files that were no longer linked/used, so I kind of expected the same behavior.

You might consider this feature request.

If my observations are correct, on "Save" changes are done in place, so leaving old stuff is just side effect"

I would consider it an improvement, if on "Save" a new archive is created and the old one is moved to "recycle bin". (Might be good idea to do that at recapture too.)

danny0838 commented 11 months ago

No, legacy ScrapBook doesn't support this feature. Actually there has been a similar request.

It's not easy to scan all unused resources in an item, especially when taking account of JavaScript related contents, CSS images, shadow roots, erased contents (which is revertible unless explicitly deleted), etc. We probably cannot implement it in near future.

iiv3 commented 11 months ago

You don't need to scan for removed content.

You just make a new capture and add the content that is referenced. You just don't run all the conversion scripts and preserve the original meta data.

It's possible to do a new "capture page as..." of the captured and edited page, but you lose e.g. the "Source" URL field. No idea what happens with the html, but it also contains some metadata.


BTW, I wasn't talking about "Scrapbook X". I was talking for the original "Scrapbook". It definitely removed unused files.

danny0838 commented 11 months ago

You don't need to scan for removed content.

You just make a new capture and add the content that is referenced. You just don't run all the conversion scripts and preserve the original meta data.

This won't be any easier, even "preserve metadata", and "replace all original item files" will require a large code work.

You can try implementing it to see if its true.

It's possible to do a new "capture page as..." of the captured and edited page, but you lose e.g. the "Source" URL field. No idea what happens with the html, but it also contains some metadata.

BTW, I wasn't talking about "Scrapbook X". I was talking for the original "Scrapbook". It definitely removed unused files.

ScrapBook X is derived from ScrapBook and we've been porting new features from ScrapBook. We are quite sure that ScrapBook doesn't have such feature. It you think I'm wrong, provide the exact ScrapBook version for further investigation.