URL parser improvements

Darthagnon commented 2 months ago

I currently use an AI-generated janky Python script to convert a list of URLs into an HTML-formatted list for use with WebToEpub: https://github.com/Darthagnon/web2epub-tidy-script

It works to solve the workflow problems I have with this extension, as explained in a previous issue (quoted below).

Would it be possible to adapt the URL parser to automatically do what I currently use my external script to do?

Darthagnon commented 2 months ago

Originally posted by @Darthagnon in https://github.com/dteviot/WebToEpub/issues/1300#issuecomment-2100587558:

Apologies, my explanation was rather confusing.

"Edit chapter URLs" is the view I use to collect chapter lists to convert to EPUB, because

the Wizards website is broken/useless/missing chapters, so there is no auto-parser that could work (EDIT: without too much work). An auto-parser would need to process https://magic.wizards.com/en/news/archive (2024), https://web.archive.org/web/2023mmddetc/https://magic.wizards.com/en/articles/archive (unreliable infinite scroller) and https://web.archive.org/web/2020mmddetc/http://www.wizards.com/Magic/Magazine/Archive.aspx (paginated, mostly 404s), https://web.archive.org/web/2014mmddetc/http://www.wizards.com/Magic/Magazine/Article.aspx

a lot of chapters are not story-related, so less useful for EPUB.

Questions

Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?

Extension: Would it be possible to (optionally) open "Edit chapter URLs" by default, rather than auto-parsing all chapters? For the default parser and manual listing on weird random websites, auto-parsing doesn't seem to work very well, gathering lots of irrelevant navbar links. It would help my workflow if I could paste story links directly, rather than fiddling with the UI to eliminate irrelevant chapters.

Extension: Edit chapter URLs requires specifying the title to use per chapter, as well as html tags, eg. a href="">Title here</a> - could it be changed to just take a list of URLs? e.g. instead of
<a href="https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22">18 Cowardice of the Hero</a>
<a href="https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05">19 Emonberry Red</a>
<a href="https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14">20 Kiora's Followers</a>
we could have
https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22
https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05
https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14
... and the titles read according to the filter template to editable fields in the chapter list:

Many thanks for any advice or help!

Darthagnon commented 2 months ago

Concept screenshot of improved workflow for WebToEpub: improvements to web2epub

gamebeaker commented 2 months ago

I think the concept isn't a bad idea. Problem: your current solution downloads all chapters two times. 1 download is with the python script to extract the title the second download is from WebToEpub for the content. If this should be implemented in WebToEpub i think there would be the need for a placeholder title as the title is only known after the Chapter is downloaded.

dteviot commented 2 months ago

@Darthagnon

Off the top of my head

Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?

Yes. See https://github.com/dteviot/WebToEpub/blob/ExperimentalTabMode/plugin/js/parsers/NoblemtlParser.js. The basic technique is for each function, e.g. Look for cover image, look for synopsis, you perform the operation both ways, and then take the first one one that works.

Extension: Would it be possible to (optionally) open "Edit chapter URLs" by default, rather than auto-parsing all chapters? For the default parser and manual listing on weird random websites, auto-parsing doesn't seem to work very well, gathering lots of irrelevant navbar links. It would help my workflow if I could paste story links directly, rather than fiddling with the UI to eliminate irrelevant chapters.

This seems something of an edge case. Does using the auto-parser actually take much time? I would have thought you're just opening "Edit chapter URLs" and deleting everything in it. i.e. Select all, delete.

Extension: Edit chapter URLs requires specifying the title to use per chapter, as well as html tags, eg. a href="">Title here - could it be changed to just take a list of URLs?

That's not a bad idea. I think the way it would work would go something like:

You can leave the title out of the hyperlink.
If there's no title, WebToEpub adds the title that it finds in the chapter.

I should clarify, formatting as hyperlinks is still required, because WebToEpub expects a list of {URL, Title} items. So the hyperlink format provides this. We'd just be adding the ability to leave the title as blank

Darthagnon commented 2 months ago

Problem: your current solution downloads all chapters two times. 1 download is with the python script to extract the title the second download is from WebToEpub for the content.

My current workflow is indeed 2-stage, because I haven't managed to write proper parsers for the websites I use, and WebToEpub does not (yet?) extract titles from chapters. So I must first grab the titles and URLs (note: as far as I know, no chapters are downloaded, just the page titles) and then supply those to WebToEpub.

I will try and put together a parser for the multiple sites I need in 1 epub, based off NoblemtlParser.js as suggested.

This seems something of an edge case. Does using the auto-parser actually take much time? I would have thought you're just opening "Edit chapter URLs" and deleting everything in it. i.e. Select all, delete.

This is my current workflow. Auto-parser doesn't take much time, just for my purposes it serves no purpose, just a part of the ritual to appease the machine spirit before actually getting to work and downloading an EPUB.

Just that auto-parser, by default, doesn't work with most given websites, so it would help me by eliminating a few clicks and fiddling if it was disabled by default and only enabled by user choice (or on detecting a supported URL).

You can leave the title out of the hyperlink. If there's no title, WebToEpub adds the title that it finds in the chapter. I should clarify, formatting as hyperlinks is still required, because WebToEpub expects a list of {URL, Title} items. So the hyperlink format provides this. We'd just be adding the ability to leave the title as blank

That sounds amazing, and exactly how I wish it worked - most chapters have some sort of <h1> title that can be picked. I'm glad if the suggestion has provided some interesting ideas!

Darthagnon commented 2 months ago

I have started the implementation of the multi-domain Wizards MtG story scraper here: https://github.com/Darthagnon/web2epub-tidy-script/blob/master/MagicWizardsParser.js (resolves #1300 )

I hate myself for using AI-generated scripts, but I know only very basic JS. Initial testing looks promising; it correctly scrapes chapters from Archive.org and the live website (though titles are duplicated and author names excluded)

dteviot commented 2 months ago

@Darthagnon

Just giving it a quick once over.

These lines should not be needed

parserFactory.register("web.archive.org", () => new MagicWizardsParser()); // For archived versions
parserFactory.registerRule(
    (url, dom) => MagicWizardsParser.isMagicWizardsTheme(dom) * 0.7,
    () => new MagicWizardsParser()
);

WebToEpub knows about the web.archive.org, and will search the rest of the URL for the original site hostname, and apply parser for that.

The lines 19, 30, 57, 67, 113,

if (window.location.hostname.includes("web.archive.org"))

should be (I think, note, not tested)

if (dom.baseURI.includes("web.archive.org"))

This is also not needed

    // Detect if the site matches the expected structure for magic.wizards.com or the archived version
    static isMagicWizardsTheme(dom) {
        // Check if the page is archived
        if (window.location.hostname.includes("web.archive.org")) {
            // Archived page structure typically wraps the original content in #content
            return dom.querySelector("#content article") != null || dom.querySelector("#content .article-content") != null;
        }
        // Regular magic.wizards.com structure
        return dom.querySelector("article") != null || dom.querySelector(".article-content") != null;
    }

Darthagnon commented 2 months ago

Swapping if (window.location.hostname.includes("web.archive.org")) to if (dom.baseURI.includes("web.archive.org")) breaks it for Archive.org pages (e.g. https://web.archive.org/web/20230127170159/https://magic.wizards.com/en/news/magic-story), though I have applied your other changes.

dteviot commented 2 months ago

@Darthagnon

If changing window.location breaks things, something is wrong.
window.location is the URL of the page the browser is showing, which is NOT the same as the current chapter/page that WebToEpub is processing. The dom parameter passed into the calls is the page being processed, so you want to switch based on it's URL.

Darthagnon commented 2 months ago

Further updates; this is latest v0.4. It's not perfect, but works more or less:

"use strict";

// Register the parser for magic.wizards.com and archive versions
parserFactory.register("magic.wizards.com", () => new MagicWizardsParser());

class MagicWizardsParser extends Parser {
    constructor() {
        super();
    }

    // Extract the list of chapter URLs
    async getChapterUrls(dom) {
        let chapterLinks = [];
        if (window.location.hostname.includes("web.archive.org")) {
            // For archived versions, select the correct container within #content
            chapterLinks = [...dom.querySelectorAll("#content article a, #content .article-content a")];
        } else {
            // For live pages
            chapterLinks = [...dom.querySelectorAll("article a, .article-content a")];
        }

        // Filter out author links using their URL pattern
        chapterLinks = chapterLinks.filter(link => !this.isAuthorLink(link));

        return chapterLinks.map(this.linkToChapter);
    }

    // Helper function to detect if a link is an author link
    isAuthorLink(link) {
        const href = link.href;
        const authorPattern = /\/archive\?author=/;

        // Check if the link matches the author URL pattern or CSS selector
        if (authorPattern.test(href)) {
            return true;
        } else {
            return false;
        }
    }

    // Format chapter links into a standardized structure
    linkToChapter(link) {
        let title = link.textContent.trim();
        return {
            sourceUrl: link.href,
            title: title
        };
    }

    // Extract the content of the chapter
    findContent(dom) {
        if (window.location.hostname.includes("web.archive.org")) {
            // For archived pages, the content is often inside #content
            return dom.querySelector("#content article");
        } else {
            // For live pages
            return dom.querySelector(".entry-content, article, .article-content");
        }
    }

}

Known issues:

Seems to work for story index pages such as:
- https://web.archive.org/web/20230127170159/https://magic.wizards.com/en/news/magic-story (includes chapter titles, which are structured as: <a href="https://web.archive.org/web/20230127170159mp_/https://magic.wizards.com/en/news/magic-story/alone"><h3>Phyrexia: All Will Be One | Alone</h3></a>
- https://magic.wizards.com/en/story (excludes chapter titles, which are structured as: <article><div><h3>The Call</h3></div><a href="https://magic.wizards.com/en/news/magic-story/call-2015-04-15"></a></article>)
Excludes author names
Sometimes excludes chapter titles
Does not crawl paginated indexes beyond the 1st page
Does not yet work for https://mtglore.com/
Ignores chapter index thumbnails and chapter summary blurbs
Uses 1st chapter index thumbnail as cover art, rather than page hero image

Darthagnon commented 2 months ago

Update: fixed chapter title parsing. It might be ready for prime-time.

gamebeaker commented 1 month ago

@Darthagnon Test versions for Firefox and Chrome have been uploaded to https://github.com/dteviot/WebToEpub/releases/tag/developer-build. Pick the one suitable for you, follow the "How to install from Source (for people who are not developers)" instructions at https://github.com/dteviot/WebToEpub/tree/ExperimentalTabMode#user-content-how-to-install-from-source-for-people-who-are-not-developers and let me know how it goes.

dteviot commented 2 weeks ago

@Darthagnon

Updated version (1.0.1.0) has been submitted to Firefox and Chrome stores. Firefox version is available now. Chrome might be available in a few hours (typical) to 21 days.

dteviot / WebToEpub

URL parser improvements #1452

Questions

Known issues: