Please add site https://pagestage.kakao.com/

liony0 commented 2 years ago

I tried using the Default Parser for the site. But I don't have any basic knowledge of JavaScript and HTML, So i failed

I tried https://pagestage.kakao.com/novels/22514217?page=5

URL of first chapter: https://pagestage.kakao.com/novels/22514217/episodes/484
CSS selector for element holding content to put into EPUB: div.sc-bdnylx.eUOjDh
CSS selector for element holding Title of Chapter: h2.sc-bdnylx.djRsjE
CSS selector for element(s) to remove

It didn't work, and i don't know what to do.

Can you add site 'https://pagestage.kakao.com/' ?

Synteresis commented 2 years ago

Experimental and Incomplete Parser

``` "use strict"; parserFactory.registerUrlRule( // Need to allow both pagestage.kakao.com // and api-pagestage.kakao.com url => util.extractHostName(url).includes("pagestage.kakao.com"), () => new KakaoParser() ); // Reimplement this entire thing using their API // They expose all of it via GET requests class KakaoParser extends Parser{ constructor() { super(); } static getChapterId(dom) { let a = dom.createElement('a'); a.href = dom.baseURI; console.log(dom.baseURI) console.log(a.pathname.split("/")[4]) return a.pathname.split("/")[4]; } static getNovelId(dom) { let a = dom.createElement('a'); a.href = dom.baseURI; console.log(dom.baseURI) console.log(a.pathname.split("/")[2]) return a.pathname.split("/")[2]; } static getApiUrl(novelId, index) { // Didn't know Javascript had ${} let sortOrder = (index === 1) ? "desc" : "asc"; return "https://api-pagestage.kakao.com/novels/" + novelId + "/episodes?size=" + index + "&sort=publishedAt,id," + sortOrder; } static getChapterUrl(novelId, index) { // Didn't know Javascript had ${} return "https://pagestage.kakao.com/novels/" + novelId + "/episodes/" + index; } static getMaximumChapters(handler) { return Promise.resolve(handler.json["totalPages"]); } static fetchChapterJson(novelId, fetchJson) { return fetchJson(KakaoParser.getApiUrl(novelId, 1)).then((handler) => { return Promise.resolve(KakaoParser.getMaximumChapters(handler)); }).then((maximumChapters) => { return fetchJson(KakaoParser.getApiUrl(novelId, maximumChapters)); }); } static parseLinkFromJson(novelId, chapterJson) { let index = chapterJson["id"]; let title = chapterJson["title"]; let url = KakaoParser.getChapterUrl(novelId, index); return { sourceUrl: url, title: title, newArc: null }; } static getAllLinksInJson(novelId, handler) { let linkArray = new Set(); return handler.json["content"].map(chapterJson => KakaoParser.parseLinkFromJson(novelId, chapterJson)); } async getChapterUrls(dom) { // Fetch first page, then look at index for all urls // Fetch all chapters and parse json // CURRENTLY UNKNOWN WHAT ZERO CHAPTERS WOULD RETURN. The json format of the website may or may not have images. I do not have known books to test. let novelId = KakaoParser.getNovelId(dom); return KakaoParser.fetchChapterJson(novelId, HttpClient.fetchJson).then((handler) => { return KakaoParser.getAllLinksInJson(novelId, handler); }); } async fetchChapter(url) { let dom = (await HttpClient.wrapFetch(url)).responseXML; let novelId = KakaoParser.getNovelId(dom); let chapterId = KakaoParser.getChapterId(dom); let contentUrl = "https://api-pagestage.kakao.com/novels/" + novelId + "/episodes/" + chapterId + "/body"; let body = (await HttpClient.fetchJson(contentUrl)).json["body"]; let bodyArray = body.split("\n"); let cdiv = dom.createElement("div"); cdiv.id = "WebToEpubKakaoTitle"; cdiv.innerText = bodyArray[0]; dom.body.append(cdiv); let div = dom.createElement("div"); div.id = "WebToEpubKakaoBody"; for(let i = 1; i < bodyArray.length; ++i) { if(!util.isNullOrEmpty(bodyArray[i])){ let p = dom.createElement("p"); p.innerText = bodyArray[i]; div.appendChild(p); } } dom.body.append(div); return dom; } findContent(dom) { console.log(dom); return dom.getElementById("WebToEpubKakaoBody"); } extractTitleImpl(dom) { return dom.querySelector('[property="og:title"]').getAttribute("content"); } extractAuthor(dom) { return dom.querySelector('[property="article:author"]').getAttribute("content"); } extractLanguage(dom){ return dom.querySelector("html").getAttribute("lang"); } extractSubject(dom) { return dom.querySelector('[name="tiara-pageMeta-category"]').getAttribute("content"); } extractDescription(dom) { return dom.querySelector('[name="description"]').getAttribute("content"); } } ```

Site doesn't work because javascript loads the chapter content afterwards so requires an api call in the fetchChapter that creates a dom and puts chapter information inside. I don't really understand how promises work so it was difficult to debug to say the least. Ran out of time to finish it.

Is there somewhere on the parser documents that says what page (chapter vs toc) each function is going to affect?

liony0 commented 2 years ago

Sorry I don't have any knowledge about this. I don't know what you are looking for.

dteviot commented 2 years ago

@eeeonoo Synteresis was talking to me, not you.

dteviot commented 2 years ago

@Synteresis

the parser documents that says what page (chapter vs toc) each function

I assume you're referring to https://github.com/dteviot/WebToEpub/blob/master/plugin/js/parsers/Template.js. In which case, Sorry, no there isn't. Can you create an issue to update the template, and maybe give a couple of examples of what you'd like to see?

Synteresis commented 2 years ago

👌I was working on tracing each function. I will make pull request in a few days when I have the time.

Personal Notes

Create two templates? One with the technical aspects of each function that specifies exactly how everything works. How it gets cleaned. Where it fails. How it fails. Where is the dom from. How many times the function is called. What page it is called on. Maybe how to resolve promises and such. Maybe how to include error handling in the code so that when it fails, it doesn't fail silently and download whatever remains. Add an option under advanced options called developer options? Lets the toggling of errors within the download so it fails loudly and specifies location of failure instead of failing silently. Provide what manually setting each tag would do. The other can be the basics with what page it affects and what you can do with them. You know. Basic.

dteviot commented 2 years ago

@Synteresis Looking at your first attempt.

Things like chapterJson["id"], can be written as chapterJson.id
To get the REST URL for chapter content, you just need to convert the chapter URL like

url.replace("pagestage", "api-pagestage") + "/body"

Then you create an empty doc and populate with content.

example,

    async fetchChapter(url) {
    let jsonUrl = url.replace("pagestage", "api-pagestage") + "/body";
        var json = (await HttpClient.fetchJson(jsonUrl)).json;
        let newDoc = Parser.makeEmptyDocForContent(url);
        let header = newDoc.dom.createElement("h1");
        newDoc.content.appendChild(header);
        header.textContent = json.title;
        let bodyArray = json.body.split("\n").filter(s => !util.isNullOrEmpty(s));
        for(let text of bodyArray) {
            let p = newDoc.dom.createElement("p");
            p.textContent = text;
            newDoc.content.appendChild(p);
        }
        return newDoc.dom;        
    }

Note, I haven't tested this

Synteresis commented 2 years ago

Oh, I didn't know there were these URL functions. Thanks, I'll get started on a branch and try it out.

Synteresis commented 2 years ago

@eeeonoo Hey, it is completed, but will need to be merged before you can compile from source with the instructions.

liony0 commented 2 years ago

@Synteresis Thank you so much! It works perfectly.

dteviot / WebToEpub

Please add site https://pagestage.kakao.com/ #639