dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
715 stars 136 forks source link

here is about novelbin.com that puts hidden tags in their texts #1446

Closed hesoyamma785 closed 1 month ago

hesoyamma785 commented 2 months ago

Describe the bug here is about noelbin.com that puts hidden tags in their texts . look at these pictures 333 111 222

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'https://novelbin.com/b/civil-servant-in-romance-fantasy#tab-chapters-titlehttps://novelbin.com/b/civil-servant-in-romance-fantasy#tab-chapters-title

  2. https://lightnovel.novelupdates.net/book/civil-servant-in-romance-fantasy/cchapter-5-i-was-dispatched-2

  3. Click on '....'

  4. Scroll down to '....'

  5. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

Kiradien commented 2 months ago

So, starting off with a slight correction: The placement of that text on sites like NovelBin is randomized. This is taken from Chrome Devtools for the chapter you were looking at above.

image image

So it's always hidden somewhere, but it's different on each generation of the page.

There are ways to clean the code of stuff like this, but these sites are constantly changing their own design to get around these work-arounds. I've personally always just edited out the BS and watermarks after generation is complete. Heck, it's currently as easy as theoretically running $("span#span").remove(); before generation.

You'll notice, even in the string that they use special characters to avoid general discovery: DiisCoover 𝒖pdated novels on n(o)v./e/lbin(.)co𝒎 - 𝒎 instead of m, etc. I believe Calibre can handle this kind of batch editing, not 100% sure - never really used it. I generally use my C# project to whip up fixes when I need them, but I don't currently have it tuned for novelbin.

I'm hoping someone else can provide better details on how to remove it with existing tools.

dteviot commented 2 months ago

@hesoyamma785 @Kiradien I'm thinking of adding code to the "EpubMerge" tool to clean this up. Basic idea:

dteviot commented 2 months ago

Notes:

dteviot commented 2 months ago

Looking at the actual HTML from site, the watermark is embedded in the content. However there's also a script to remove it. Something like

const original11Content = $(this).html();
const updated11Content = original11Content.replace("Visitt nov𝒆lbin(.)c𝒐/m for the l𝒂test updates", `<span id="span">Visitt nov𝒆lbin(.)c𝒐/m for the l𝒂test updates</span>`);
hesoyamma785 commented 2 months ago

I have a question when I got the novel, this tags come as a text as you can see in the pictures i sent not as in sites that is sth like this span id="span"> it is for this site only or for all the sites that there is no of this thing span id="span">

dteviot commented 2 months ago

@hesoyamma785

I'm having trouble understanding what you've written. So, I'll try and answer based on what I think you're asking.

  1. I'm referring to the novelbin site, not any other.
  2. The raw HTML for a page does NOT have the <span id="span"> element. Just the "naked" watermark text.
  3. However, there is a <script> element in the HTML that converts the "raw" watermark text into a <span id="span"> element when the page is viewed.
  4. WebToEpub doesn't view the page, so the watermark text remains in embedded in the content that WebToEpub packages into an epub.
hesoyamma785 commented 2 months ago

@hesoyamma785

I'm having trouble understanding what you've written. So, I'll try and answer based on what I think you're asking.

1. I'm referring to the novelbin site, not any other.

2. The raw HTML for a page does NOT have the <span id="span"> element.  Just the "naked" watermark text.

3. However, there is a <script> element in the HTML that converts the "raw" watermark text into a <span id="span"> element when the page is viewed.

4. WebToEpub doesn't view the page, so the watermark text remains in embedded in the content that WebToEpub packages into an epub.

yes its as you said

hesoyamma785 commented 2 months ago

and I have another question what if i block the script using "ublock origin", will it also be removed from the epub file after doing this method or it is pointless?

dteviot commented 2 months ago

@hesoyamma785

It's pointless, because WebToEpub does not run the script in the first place. That's why you see the "watermark" in the epub.

FWIW, I don't think you can block the script with ublock origin, because the script is within the HTML page itself. You'd need something like no-script. In which case, you'd see the watermark in the text if you viewed the site's chapters with a browser. Assuming the site will work with scripts disabled.

dteviot commented 2 months ago

Results so far. It's not hard to find the line of text with the Novebin "watermark". The following seems to find nearly all of them

    let text = node.data;
    return ((text.normalize('NFKD') != text) && (text.includes("(")))

But finding where the wanted text ends and the Watermark begins is proving to be much more difficult. I'm thinking might need to have WebToEpub scan the <script> elements for the watermark text, and then use that to know what to remove. Which might be getting a bit to close for Google's rules.

Time spent: 130 minutes (so far).

dteviot commented 1 month ago

@hesoyamma785

The watermarking seems to have stopped. The "original11Content.replace(" javascript is still there, and stylesheet element to hide the resulting <span>. but the text was empty.

I just tried:

Going to put this on hold. Please let me know if you see it again.

Time spent: 166 minutes (so far).

hesoyamma785 commented 1 month ago

@hesoyamma785

The watermarking seems to have stopped. The "original11Content.replace(" javascript is still there, and stylesheet element to hide the resulting . but the text was empty.

I just tried:

Going to put this on hold. Please let me know if you see it again.

Time spent: 166 minutes (so far).

I think they changed their method or sth like that cause if you look at this picture bellow image there is the tag https://novelbin.com/b/cultivation-online-novel https://novelbjn.phieuvu.com/book/cultivation-online-novel/chapter-1596-primal-expanse

in the inspection page , i put the value of span visibility to 1 so that this could be shown and unlike previous situation there isn't <spaN command in the inspection page

dteviot commented 1 month ago

@hesoyamma785

OK, I've just made a change. WebToEpub should now push the watermark into a <span> element with an id of "span", just like the site's javascript does when viewing in a browser. WebToEpub also marks the <span> as hidden, so MOST epub viewers should not show the element. (There's a few that don't know the hidden attribute.)

If that's a problem, you can use EpubEditor to remove these <span> elements. Script not supplied, but https://github.com/dteviot/EpubEditor/issues/4 should provide enough information to do it yourself.

Note, WebToEpub probably won't handle case when there is more than one different watermark on a page. (2nd and later watermarks will not be removed.) But since that's a rare case, I don't have an example of it to examine and figure out how to handle.

Test versions for Firefox and Chrome have been uploaded to https://github.com/dteviot/WebToEpub/releases/tag/developer-build. Pick the one suitable for you, follow the "How to install from Source (for people who are not developers)" instructions at https://github.com/dteviot/WebToEpub/tree/ExperimentalTabMode#user-content-how-to-install-from-source-for-people-who-are-not-developers and let me know how it goes. Tested with:

Notes Time taken: 296 minutes (running total)

hesoyamma785 commented 1 month ago

Thanks for your help and hard work ☺️

dteviot commented 1 month ago

Reopen, so I know to notify you when Chrome and Firefox stores updated.

dteviot commented 1 month ago

@hesoyamma785 Updated version (1.0.0.0) has been submitted to Firefox and Chrome stores. Firefox version is available now. Chrome might be available in a few hours to 21 days.