dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
723 stars 136 forks source link

Add Option to Settings to Suppress Hidden Elements #1268

Open Nitrousoxide opened 7 months ago

Nitrousoxide commented 7 months ago

Is your feature request related to a problem? Please describe. More sites are using hidden text watermarks in an attempt to poison LLM harvesting. While some site parsers have been updated, a general setting to suppress hidden elements like these would be helpful for sites with a parser that's 4 years old (example which suffers from hidden watermark), or if a user makes their own because the plugin can't identify how to handle the page (like this one, though it doesn't use a watermark, a site that does or makes use of hidden elements that hurt readability would be applicable)

Describe the solution you'd like A checkbox to suppress hidden elements from being rendered into the epub in the settings

Describe alternatives you've considered Updating all the parsers as they implement watermarks is a potential option and it probably should still be done even if this option is implemented. But it wouldn't protect sites like azaleaellis's above example which have no parser at all and require a user-defined one.

dteviot commented 7 months ago

Unfortunately, it's not that simple. The way the watermarking is done differs from site to site. Conceptually, the usual way is a watermark is marked with some sort of tag, and there is javascript that hides/removes the element(s) with the mark when viewing normally. However, as WebToEpub does not run the javascript, the watermark elements are remain.

And the tagging differs from site to site. e.g. https://re-library.com/ seems to tag the watermarks with a number of different classes, although it looks like "code-block" is the key element.

Anyway, I've provided EpubEditor https://github.com/dteviot/EpubEditor/issues/4 to make it pretty easy to fix up the epubs after they are collected.

For example, to fix this in the above re:library, the following script can be used.

            for(let p of [...dom.querySelectorAll("div.code-block")]) {
                    p.remove();
            }
            return true;
Nitrousoxide commented 7 months ago

Oh for sure, there are a multitude of ways one could try to hide watermarking, and I do think updating the parsers or editing the finished product would give the best tailored response. But a checkbox to block or remove common known techniques for watermarking might be a good addition to the WebToEpub, with the note that it's a generic solution and may not work for every hidden element.

Totally understand if you think this is out of scope though, so if so please feel free to close.

kevin01523 commented 7 months ago

theres an option to remove tags on an uknown site i forgot how to access it i did use it to remove ads lol or comment section etc

its probably easy to add for the most common watermark or hidden elements and the unusual ones for epub editor to work on