Open Nitrousoxide opened 7 months ago
Unfortunately, it's not that simple. The way the watermarking is done differs from site to site. Conceptually, the usual way is a watermark is marked with some sort of tag, and there is javascript that hides/removes the element(s) with the mark when viewing normally. However, as WebToEpub does not run the javascript, the watermark elements are remain.
And the tagging differs from site to site. e.g. https://re-library.com/ seems to tag the watermarks with a number of different classes, although it looks like "code-block" is the key element.
Anyway, I've provided EpubEditor https://github.com/dteviot/EpubEditor/issues/4 to make it pretty easy to fix up the epubs after they are collected.
For example, to fix this in the above re:library, the following script can be used.
for(let p of [...dom.querySelectorAll("div.code-block")]) {
p.remove();
}
return true;
Oh for sure, there are a multitude of ways one could try to hide watermarking, and I do think updating the parsers or editing the finished product would give the best tailored response. But a checkbox to block or remove common known techniques for watermarking might be a good addition to the WebToEpub, with the note that it's a generic solution and may not work for every hidden element.
Totally understand if you think this is out of scope though, so if so please feel free to close.
theres an option to remove tags on an uknown site i forgot how to access it i did use it to remove ads lol or comment section etc
its probably easy to add for the most common watermark or hidden elements and the unusual ones for epub editor to work on
Is your feature request related to a problem? Please describe. More sites are using hidden text watermarks in an attempt to poison LLM harvesting. While some site parsers have been updated, a general setting to suppress hidden elements like these would be helpful for sites with a parser that's 4 years old (example which suffers from hidden watermark), or if a user makes their own because the plugin can't identify how to handle the page (like this one, though it doesn't use a watermark, a site that does or makes use of hidden elements that hurt readability would be applicable)
Describe the solution you'd like A checkbox to suppress hidden elements from being rendered into the epub in the settings
Describe alternatives you've considered Updating all the parsers as they implement watermarks is a potential option and it probably should still be done even if this option is implemented. But it wouldn't protect sites like azaleaellis's above example which have no parser at all and require a user-defined one.