dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
694 stars 134 forks source link

Freewebnovel Fancy Text #1015

Open SirGryphin opened 1 year ago

SirGryphin commented 1 year ago

So freewebnovel.com is adding random "Fancy Text Fonts" to the epub with advertisements. They seem to appear in different place each time you scrape the book to epub. Is there anyway to prevent this?

dteviot commented 1 year ago

@SirGryphin Can you provide URL and example?

SirGryphin commented 1 year ago

This is url: https://freewebnovel.com/the-genius-doctor-my-wife-is-valiant.html

It doesn't show until it scraped to epub looks fine on the site.

dteviot commented 1 year ago

@SirGryphin

I tried chapters 1 and 2, don't see anything.
To save me time, can you please send me an epub with issue and tell me which chapters to look at? Or, if you have the skill, unzip the epub and just send me the .xhtml files for a couple of chapters with the problem.

For my notes: 7 minutes work

SirGryphin commented 1 year ago

@dteviot

I did a re-scrape and it changed again, Before it was Fancy Text Fonts in p tags with freewebnovel text now it changed to Fancy Text Fonts with subtxt tag like this <subtxt>𝒊𝒏𝙣𝒓𝙚𝒂𝒅.𝙘𝒐𝙢</subtxt>

Here is a xhtml with the problem if you open in any editor notepad or sigil you will see the weird font. 1256_Chapter1257-_1257_1257_Facing_His_Nightmares.zip

BaconBits321 commented 1 year ago

Did you ever figure out how to fix this?

SirGryphin commented 1 year ago

I also use another python script not sure if it okay to post name here. But I end up re-writing there script for the site using unicodedata. Ended up looks something like this (its only part of the code) not sure if it will work for webtoepub.

def normalize_text(self, text: str) -> str:
        return unicodedata.normalize("NFKC", text)

    def select_chapter_body(self, soup: BeautifulSoup) -> Tag:
        body_tag = soup.select_one(".m-read .txt")
        if body_tag:
            normalized_body = self.normalize_text(str(body_tag))
            normalized_soup = BeautifulSoup(normalized_body, "html.parser")
            return normalized_soup
        return body_tag
dteviot commented 1 year ago

@SirGryphin I'm curious how this solves the problem. It looks to me like this just normalizes the text. So, the innread stuff is still there.

Note, I haven't given up on fixing this.

dteviot commented 1 year ago

Note to self. See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize for javascript version of

unicodedata.normalize("NFKC", text).

So, could walk all text nodes, and apply normalization.
Or perhaps, discard them if find "bednovel" in string after normalization?

swanknight commented 9 months ago

This is what I used in EpubEditor to normalize that nonsense, later I could replace them.

var p = dom.querySelectorAll('p'), i;

for (i = 0; i < p.length; ++i) {
    p[i].textContent = p[i].textContent.normalize('NFKD').replace(/[^\x00-\x7F“”’]/g, '');
}
BaconBits321 commented 8 months ago

This is what I used in EpubEditor to normalize that nonsense, later I could replace them.

var p = dom.querySelectorAll('p'), i;

for (i = 0; i < p.length; ++i) {
    p[i].textContent = p[i].textContent.normalize('NFKD').replace(/[^\x00-\x7F“”’]/g, '');
}

How do I do this? I have sigil.

swanknight commented 8 months ago

How do I do this? I have sigil.

I use @dteviot EpubEditor, I got it from his Google Drive.

https://drive.google.com/drive/folders/1B_X2WcsaI_eg9yA-5bHJb8VeTZGKExl8