dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
679 stars 132 forks source link

URL parser improvements #1452

Open Darthagnon opened 2 weeks ago

Darthagnon commented 2 weeks ago

I currently use an AI-generated janky Python script to convert a list of URLs into an HTML-formatted list for use with WebToEpub: https://github.com/Darthagnon/web2epub-tidy-script

It works to solve the workflow problems I have with this extension, as explained in a previous issue (quoted below).

Would it be possible to adapt the URL parser to automatically do what I currently use my external script to do?

Darthagnon commented 2 weeks ago

Originally posted by @Darthagnon in https://github.com/dteviot/WebToEpub/issues/1300#issuecomment-2100587558:

Apologies, my explanation was rather confusing.

"Edit chapter URLs" is the view I use to collect chapter lists to convert to EPUB, because

  • the Wizards website is broken/useless/missing chapters, so there is no auto-parser that could work (EDIT: without too much work). An auto-parser would need to process https://magic.wizards.com/en/news/archive (2024), https://web.archive.org/web/2023mmddetc/https://magic.wizards.com/en/articles/archive (unreliable infinite scroller) and https://web.archive.org/web/2020mmddetc/http://www.wizards.com/Magic/Magazine/Archive.aspx (paginated, mostly 404s), https://web.archive.org/web/2014mmddetc/http://www.wizards.com/Magic/Magazine/Article.aspx
  • a lot of chapters are not story-related, so less useful for EPUB.

Questions

  1. Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?
  2. Extension: Would it be possible to (optionally) open "Edit chapter URLs" by default, rather than auto-parsing all chapters? For the default parser and manual listing on weird random websites, auto-parsing doesn't seem to work very well, gathering lots of irrelevant navbar links. It would help my workflow if I could paste story links directly, rather than fiddling with the UI to eliminate irrelevant chapters.
  3. Extension: Edit chapter URLs requires specifying the title to use per chapter, as well as html tags, eg. a href="">Title here</a> - could it be changed to just take a list of URLs? e.g. instead of
<a href="https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22">18 Cowardice of the Hero</a>
<a href="https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05">19 Emonberry Red</a>
<a href="https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14">20 Kiora's Followers</a>

we could have

https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22
https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05
https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14

... and the titles read according to the filter template to editable fields in the chapter list:

chrome_240508_57

Many thanks for any advice or help!

Darthagnon commented 2 weeks ago

Concept screenshot of improved workflow for WebToEpub: improvements to web2epub

gamebeaker commented 2 weeks ago

I think the concept isn't a bad idea. Problem: your current solution downloads all chapters two times. 1 download is with the python script to extract the title the second download is from WebToEpub for the content. If this should be implemented in WebToEpub i think there would be the need for a placeholder title as the title is only known after the Chapter is downloaded.

dteviot commented 2 weeks ago

@Darthagnon

Off the top of my head

Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?

Yes. See https://github.com/dteviot/WebToEpub/blob/ExperimentalTabMode/plugin/js/parsers/NoblemtlParser.js. The basic technique is for each function, e.g. Look for cover image, look for synopsis, you perform the operation both ways, and then take the first one one that works.

Extension: Would it be possible to (optionally) open "Edit chapter URLs" by default, rather than auto-parsing all chapters? For the default parser and manual listing on weird random websites, auto-parsing doesn't seem to work very well, gathering lots of irrelevant navbar links. It would help my workflow if I could paste story links directly, rather than fiddling with the UI to eliminate irrelevant chapters.

This seems something of an edge case. Does using the auto-parser actually take much time? I would have thought you're just opening "Edit chapter URLs" and deleting everything in it. i.e. Select all, delete.

Extension: Edit chapter URLs requires specifying the title to use per chapter, as well as html tags, eg. a href="">Title here - could it be changed to just take a list of URLs?

That's not a bad idea. I think the way it would work would go something like:

  1. You can leave the title out of the hyperlink.
  2. If there's no title, WebToEpub adds the title that it finds in the chapter.

I should clarify, formatting as hyperlinks is still required, because WebToEpub expects a list of {URL, Title} items. So the hyperlink format provides this. We'd just be adding the ability to leave the title as blank

Darthagnon commented 5 days ago

Problem: your current solution downloads all chapters two times. 1 download is with the python script to extract the title the second download is from WebToEpub for the content.

My current workflow is indeed 2-stage, because I haven't managed to write proper parsers for the websites I use, and WebToEpub does not (yet?) extract titles from chapters. So I must first grab the titles and URLs (note: as far as I know, no chapters are downloaded, just the page titles) and then supply those to WebToEpub.

I will try and put together a parser for the multiple sites I need in 1 epub, based off NoblemtlParser.js as suggested.


This seems something of an edge case. Does using the auto-parser actually take much time? I would have thought you're just opening "Edit chapter URLs" and deleting everything in it. i.e. Select all, delete.

This is my current workflow. Auto-parser doesn't take much time, just for my purposes it serves no purpose, just a part of the ritual to appease the machine spirit before actually getting to work and downloading an EPUB.

Just that auto-parser, by default, doesn't work with most given websites, so it would help me by eliminating a few clicks and fiddling if it was disabled by default and only enabled by user choice (or on detecting a supported URL).

You can leave the title out of the hyperlink. If there's no title, WebToEpub adds the title that it finds in the chapter. I should clarify, formatting as hyperlinks is still required, because WebToEpub expects a list of {URL, Title} items. So the hyperlink format provides this. We'd just be adding the ability to leave the title as blank

That sounds amazing, and exactly how I wish it worked - most chapters have some sort of <h1> title that can be picked. I'm glad if the suggestion has provided some interesting ideas!

Darthagnon commented 5 days ago

I have started the implementation of the multi-domain Wizards MtG story scraper here: https://github.com/Darthagnon/web2epub-tidy-script/blob/master/MagicWizardsParser.js (resolves #1300 )

I hate myself for using AI-generated scripts, but I know only very basic JS. Initial testing looks promising; it correctly scrapes chapters from Archive.org and the live website (though titles are duplicated and author names excluded)

dteviot commented 5 days ago

@Darthagnon

Just giving it a quick once over.

These lines should not be needed

parserFactory.register("web.archive.org", () => new MagicWizardsParser()); // For archived versions
parserFactory.registerRule(
    (url, dom) => MagicWizardsParser.isMagicWizardsTheme(dom) * 0.7,
    () => new MagicWizardsParser()
);

WebToEpub knows about the web.archive.org, and will search the rest of the URL for the original site hostname, and apply parser for that.

The lines 19, 30, 57, 67, 113,

if (window.location.hostname.includes("web.archive.org")) 

should be (I think, note, not tested)

if (dom.baseURI.includes("web.archive.org")) 

This is also not needed

    // Detect if the site matches the expected structure for magic.wizards.com or the archived version
    static isMagicWizardsTheme(dom) {
        // Check if the page is archived
        if (window.location.hostname.includes("web.archive.org")) {
            // Archived page structure typically wraps the original content in #content
            return dom.querySelector("#content article") != null || dom.querySelector("#content .article-content") != null;
        }
        // Regular magic.wizards.com structure
        return dom.querySelector("article") != null || dom.querySelector(".article-content") != null;
    }
Darthagnon commented 5 days ago

Swapping if (window.location.hostname.includes("web.archive.org")) to if (dom.baseURI.includes("web.archive.org")) breaks it for Archive.org pages (e.g. https://web.archive.org/web/20230127170159/https://magic.wizards.com/en/news/magic-story), though I have applied your other changes.

dteviot commented 5 days ago

@Darthagnon

If changing window.location breaks things, something is wrong.
window.location is the URL of the page the browser is showing, which is NOT the same as the current chapter/page that WebToEpub is processing. The dom parameter passed into the calls is the page being processed, so you want to switch based on it's URL.

Darthagnon commented 5 days ago

Further updates; this is latest v0.4. It's not perfect, but works more or less:

"use strict";

// Register the parser for magic.wizards.com and archive versions
parserFactory.register("magic.wizards.com", () => new MagicWizardsParser());

class MagicWizardsParser extends Parser {
    constructor() {
        super();
    }

    // Extract the list of chapter URLs
    async getChapterUrls(dom) {
        let chapterLinks = [];
        if (window.location.hostname.includes("web.archive.org")) {
            // For archived versions, select the correct container within #content
            chapterLinks = [...dom.querySelectorAll("#content article a, #content .article-content a")];
        } else {
            // For live pages
            chapterLinks = [...dom.querySelectorAll("article a, .article-content a")];
        }

        // Filter out author links using their URL pattern
        chapterLinks = chapterLinks.filter(link => !this.isAuthorLink(link));

        return chapterLinks.map(this.linkToChapter);
    }

    // Helper function to detect if a link is an author link
    isAuthorLink(link) {
        const href = link.href;
        const authorPattern = /\/archive\?author=/;

        // Check if the link matches the author URL pattern or CSS selector
        if (authorPattern.test(href)) {
            return true;
        } else {
            return false;
        }
    }

    // Format chapter links into a standardized structure
    linkToChapter(link) {
        let title = link.textContent.trim();
        return {
            sourceUrl: link.href,
            title: title
        };
    }

    // Extract the content of the chapter
    findContent(dom) {
        if (window.location.hostname.includes("web.archive.org")) {
            // For archived pages, the content is often inside #content
            return dom.querySelector("#content article");
        } else {
            // For live pages
            return dom.querySelector(".entry-content, article, .article-content");
        }
    }

}

Known issues:

Darthagnon commented 4 days ago

Update: fixed chapter title parsing. It might be ready for prime-time.