dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
676 stars 132 forks source link

Parser issue: nepusation #89

Closed typhoon71 closed 7 years ago

typhoon71 commented 7 years ago

The parser for "http://www.nepustation.com/" grabs... garbage; I think the content is "protected" some way or another.

If you try to grab from "http://www.nepustation.com/cheat-majutsu-de-unmei-wo-nejifuseru/" you get garbage text in the chapter; you get the same if you don't enable JS for nepustation.com while browsing.

A direct link to the page "http://www.nepustation.com/cheat-majutsu-de-unmei-wo-nejifuseru/first-episode-hero/" (the links parsed by the addon are not the right ones...)

Oh, did I forget to say there's a hidden text (NEPU) too? This is mental despite being MTL...

I hope it's fixable...

dteviot commented 7 years ago

@typhoon71

The parser for "http://www.nepustation.com/" grabs... garbage; I think the content is "protected" some way or another.

I agree, a quick look at the raw html received by the Web Browser shows the content text has been encrypted. Almost certainly js code is used to convert this back into readable text.

A direct link to the page "http://www.nepustation.com/cheat-majutsu-de-unmei-wo-nejifuseru/first-episode-hero/" (the links parsed by the addon are not the right ones...)

That’s not quite correct. On the main page e.g. http://www.nepustation.com/cheat-majutsu-de-unmei-wo-nejifuseru/ the links don’t point directly to the chapters, they instead point to posts by “Nepu”, and the posts point to the chapter. e.g. The link on on the main page for “Chapter One: Hero” points to http://www.nepustation.com/cheat-majutsu-ch-1-eps-1/ The link on that page points to http://www.nepustation.com/cheat-majutsu-de-unmei-wo-nejifuseru/first-episode-hero/

WebToEpub is not doing “link chasing” for the Nepustation site, as I haven’t added a parser for Nepustation. WebToEpub is selecting the WordpressBaseParser, based on analysis of the HTML’s DOM structure.

FWIW, as of version 0.0.0.24, if the URL is not known, WebToEpub examines the HTML and based on that, picks either Wordpress, Blogspot or Default parser.

Oh, did I forget to say there's a hidden text (NEPU) too? This is mental despite being MTL...

Based on the anti-addblock configuration I found, I’m guessing the purpose is to force people to view the content via the site with adds. i.e. Nepu does not want people scraping the site.

I hope it's fixable...

I’m reasonably sure I could decrypt the text, given sufficient time. Unfortunately, I’ve got better uses for my time. Also, as the site has been encrypted to prevent things like WebToEpub, doing that is:

dteviot commented 7 years ago

@typhoon71 On further investigation, the “protection” appears to be a simple substitution cypher. Where each letter in the alphabet “ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz” has been replaced with the matching value from “ḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳ”.

So, you can get the unscrambled text by adding the following code to the end of the WordpressBaseParser.js file

parserFactory.register("nepustation.com", function() { return new NepustationParser() });

class CryptEngine {
    constructor() {
        let ALPHABET     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
        let NEPUALPHABET = "ḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲ"+
                           "ḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳ";
        this.decryptTable = new Map();
        for(let i = 0; i < ALPHABET.length; ++i) {
            let cy = NEPUALPHABET.charAt(i);
            let cl = ALPHABET.charAt(i);
            this.decryptTable.set(cy, cl);
        }
    }

    decrypt(cypherText) {
        let that = this;
        let decryptChar = function(c) {
            let t = that.decryptTable.get(c);
            return (t === undefined) ? c : t;
        }
        return cypherText.split("").map(c => decryptChar(c)).join("");
    }

}

class NepustationParser extends WordpressBaseParser {
    constructor() {
        super();
        this.cryptEngine = new CryptEngine(); 
    }

    customRawDomToContentStep(chapter, content) {
        let walker = document.createTreeWalker(content, NodeFilter.SHOW_TEXT);
        let engine = this.cryptEngine; 
        let node = null;
        while ((node = walker.nextNode())) {
            node.textContent = engine.decrypt(node.textContent);
        };
    }
}

Notes

typhoon71 commented 7 years ago

Nice, I'll try that JS then; I suppose I can make my own nepuparser then XD.

My issue was with the scrambled content, I can remove the rest of the unwanted stuff with calibre editor (search and replace, easy).

On the "ethical" side of the things, I think nepu wanted ppl to read the stuff on his/her site too, and did all that for sites like Readlightnovel that parse everything... to generate traffic. That's how I see it. I did see the contact info, but it's not like I can ask an epub for every release right?

But I can make one myself ;)

typhoon71 commented 7 years ago

I noticed the latest ExperimentalTabMode branch has a Nepustation parser, but it doesn't seem to work. Am I missing something?

dteviot commented 7 years ago

@typhoon71

Am I missing something?

It's not connected up. You'd need to add a <script> entry to the popup.html file after the WordpressBaseParser entry to make it work.

typhoon71 commented 7 years ago

Perfect, it works now. Btw, will this be merged to the main release? Or you'll keep it a "hidden" ability?

dteviot commented 7 years ago

As I explained earlier, if that parser was enabled in the plug-in by default, the plug-in would be removed from the Chrome and Firefox stores. Note the plug-in in the stores does not contain the Nepustation parser at all. But I think it's OK to have the code in GitHub, but, individual users must enable it on themselves.

typhoon71 commented 7 years ago

It's not that I forgot what you said before, but I was making sure since I didn't think keeping it on github would have been OK. Thanks for making my life easier. ;)

fake-name commented 7 years ago

It's not a javascript cypher, per-se, it's a custom font, with the codepoints reordered so the content only displays correctly with that specific font.

KobatoChan added something similar just recently: https://kobatochan.com/everyone-else-is-a-returnee-chapter-91/ (The comments are full of some salt gold, incidentally)

I'm putting together a tool to pull the coding out of the custom font. My concern is if they scripted the font-generation, you could wind up with a different replacement cipher per-page.

fake-name commented 7 years ago

Ok, it's about 90% done (I'm temporarily distracted by DB issues - I ran out of disk space!).

https://github.com/fake-name/ReadableWebProxy/blob/master/WebMirror/processor/KobatoChanDaiSukiPreprocessor.py

Roughly, this is python code which:

I still have to walk the HTML page, and apply the symbol mapping to any marked span/div/etc, but that's easy.

Things to do: Tesseract for OCR?

dteviot commented 7 years ago

@fake-name Thanks for https://github.com/fake-name/ReadableWebProxy/blob/master/WebMirror/processor/KobatoChanDaiSukiPreprocessor.py. Unfortunately, I'm not familiar with TTF file format, so I can't comment on the details of the defont() function. (Side not: I'm going to have to learn that TTF and how to do in Javascrpit: http://stevehanov.ca/blog/index.php?id=143 looks like an excellent place to start.)

I did note that you're doing this sort of URL resolution in two places.

if "http://" in item['href'] or "https://" in item['href']:
    return item['href']
else:
    return urllib.parse.urljoin(self.pageUrl, item['href'])

You might want to push this into a function in the HtmlPageProcessor class.

fake-name commented 7 years ago

Yeah, it probably took me ~4 hours to write that, and ~3 hours of that were head-scratching and poking around the font library.

fake-name commented 7 years ago

Ok, thing is complete and seems to work.

One note is that it doesn't properly decode "걣" as "A", because the process they used to generate the font file seems to have changed the glyph ordering or something (not sure why).

All the other symbols extract properly.

typhoon71 commented 7 years ago

How does this differ from the nepuparser? How does one use it? I'm not interested in KobatoChan for now, but I guess this content "lol-protecting" will became common.

fake-name commented 7 years ago

How does this differ from the nepuparser?

It's considerably more generic, ergo more robust. In particular, it derives the codepoint mappings from the served site, so if the site changes the "protection", it'll still unprotect correctly.

How does one use it?

You probably don't, but @dteviot might use it as a basis for a more robust solution to what I assume is a manually generated replacement-cipher key.

I'm not interested in KobatoChan for now, but I guess this content "lol-protecting" will became common.

The reason I posted anything in this particular bug report is because, afict, both "protection" systems are the same. They both use the useanyfont wordpress plugin, and function on a similar idea (a custom font and codepoint substitution). I think the nepu guy provided his work to kobatochan.

Mind you, it's still poorly done. In breaking it the way I did, I can think of a bunch of ways to make it much harder to circumvent, but I'm not going to air those publicly.

dteviot commented 7 years ago

@fake-name Firstly, let me say thank you for your time and information. It is appreciated.

Things to do: Tesseract for OCR?

I'm going to suggest that's overkill, as a large chunk of the OCR is trying to convert the image into a set of strokes. And then go from the strokes to the character. The font file's g_l_y_f already as the set of strokes, So, look at the strokes to see what the letter is. e.g. A large vertical stroke and three horizontal strokes is an 'E'. a large vertical and two horizontal is an 'F'. Note, you also don't need to do an analysis of every glyph. Examining the HTML will give you the codepoints for the characters that are seen by the user. You can then use the cmap table to go from codepoint to glyph.

You probably don't, but @dteviot might use it as a basis for a more robust solution to what I assume is a manually generated replacement-cipher key.

I'm going to refrain from doing anything until I see this in more common use.
Also, at moment, I'm trying to figure out what readlightnovel has done.

fake-name commented 7 years ago

I'm going to suggest that's overkill, as a large chunk of the OCR is trying to convert the image into a set of strokes. And then go from the strokes to the character. The font file's g_l_y_f already as the set of strokes, So, look at the strokes to see what the letter is. e.g. A large vertical stroke and three horizontal strokes is an 'E'. a large vertical and two horizontal is an 'F'.

This is true, but it looks like at least one of the fonts here uses glyph outlines, rather then strokes (I think?).

Anyways, it's easier from my perspective to plug in a easy font-rendering library, and a OCR library then figuring out how to leave a bunch of the intermediate parts out, as at that point I'd have to implement my own partial font-rendering and partial character recognition.

Note, you also don't need to do an analysis of every glyph. Examining the HTML will give you the codepoints for the characters that are seen by the user. You can then use the cmap table to go from codepoint to glyph.

I was trying to implement my thing in a blind manner, ergo derive the mappings from the font without having to have any other knowledge. I mostly did this just because it makes the code more modular.

I'm going to refrain from doing anything until I see this in more common use. Also, at moment, I'm trying to figure out what readlightnovel has done.

My concern (and what I thought they did from the outset) was render a custom randomized font per page, at which point automation is really the only possible solution.

Apparently that's assuming too much competence from the people doing these things (I guess if they're smart enough to do that, they're also smart enough to realize "protecting" HTTP web-pages is retarded), so it's a simple single-key substitution cipher.

I think I probably scrape readlightnovel too, so I'll look at them later this evening. I think we probably do a bunch of similar things (I do large-scale archiving of LN/WN sites as a hobby). Maybe we should compare notes.


Edit: Is readlightnovel an aggregator, rather then a site with actual TLs? I don't bother scraping those.


Edit edit: The font parsing system I implemented can handle "unprotecting" the nepusation crap without any work. I literally just added a URL to the system that determines the sites to feed through the font-map processing filter, and it's now all fixed.

Hurrah for generic implementations?

dteviot commented 7 years ago

@fake-name

Also, at moment, I'm trying to figure out what readlightnovel has done.

Well, this is embarrassing. looks like it had nothing to do with the Captcha. Problem was they've changed the class name used to wrap the content. Provided you pass the Captcha, everything else works fine (due to changes in WebToEpub version 0.0.0.26 to make sure the fetch sends all current session cookies.)