Closed Darthagnon closed 2 weeks ago
Originally posted by @Darthagnon in https://github.com/dteviot/WebToEpub/issues/1300#issuecomment-2100587558:
Apologies, my explanation was rather confusing.
"Edit chapter URLs" is the view I use to collect chapter lists to convert to EPUB, because
- the Wizards website is broken/useless/missing chapters, so there is no auto-parser that could work (EDIT: without too much work). An auto-parser would need to process
https://magic.wizards.com/en/news/archive
(2024),https://web.archive.org/web/2023mmddetc/https://magic.wizards.com/en/articles/archive
(unreliable infinite scroller) andhttps://web.archive.org/web/2020mmddetc/http://www.wizards.com/Magic/Magazine/Archive.aspx
(paginated, mostly 404s),https://web.archive.org/web/2014mmddetc/http://www.wizards.com/Magic/Magazine/Article.aspx
- a lot of chapters are not story-related, so less useful for EPUB.
Questions
- Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?
- Extension: Would it be possible to (optionally) open "Edit chapter URLs" by default, rather than auto-parsing all chapters? For the default parser and manual listing on weird random websites, auto-parsing doesn't seem to work very well, gathering lots of irrelevant navbar links. It would help my workflow if I could paste story links directly, rather than fiddling with the UI to eliminate irrelevant chapters.
- Extension: Edit chapter URLs requires specifying the title to use per chapter, as well as html tags, eg.
a href="">Title here</a>
- could it be changed to just take a list of URLs? e.g. instead of<a href="https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22">18 Cowardice of the Hero</a> <a href="https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05">19 Emonberry Red</a> <a href="https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14">20 Kiora's Followers</a>
we could have
https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22 https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05 https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14
... and the titles read according to the filter template to editable fields in the chapter list:
Many thanks for any advice or help!
Concept screenshot of improved workflow for WebToEpub:
I think the concept isn't a bad idea. Problem: your current solution downloads all chapters two times. 1 download is with the python script to extract the title the second download is from WebToEpub for the content. If this should be implemented in WebToEpub i think there would be the need for a placeholder title as the title is only known after the Chapter is downloaded.
@Darthagnon
Off the top of my head
Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?
Yes. See https://github.com/dteviot/WebToEpub/blob/ExperimentalTabMode/plugin/js/parsers/NoblemtlParser.js. The basic technique is for each function, e.g. Look for cover image, look for synopsis, you perform the operation both ways, and then take the first one one that works.
Extension: Would it be possible to (optionally) open "Edit chapter URLs" by default, rather than auto-parsing all chapters? For the default parser and manual listing on weird random websites, auto-parsing doesn't seem to work very well, gathering lots of irrelevant navbar links. It would help my workflow if I could paste story links directly, rather than fiddling with the UI to eliminate irrelevant chapters.
This seems something of an edge case. Does using the auto-parser actually take much time? I would have thought you're just opening "Edit chapter URLs" and deleting everything in it. i.e. Select all, delete.
Extension: Edit chapter URLs requires specifying the title to use per chapter, as well as html tags, eg. a href="">Title here - could it be changed to just take a list of URLs?
That's not a bad idea. I think the way it would work would go something like:
I should clarify, formatting as hyperlinks is still required, because WebToEpub expects a list of {URL, Title} items. So the hyperlink format provides this. We'd just be adding the ability to leave the title as blank
Problem: your current solution downloads all chapters two times. 1 download is with the python script to extract the title the second download is from WebToEpub for the content.
My current workflow is indeed 2-stage, because I haven't managed to write proper parsers for the websites I use, and WebToEpub does not (yet?) extract titles from chapters. So I must first grab the titles and URLs (note: as far as I know, no chapters are downloaded, just the page titles) and then supply those to WebToEpub.
I will try and put together a parser for the multiple sites I need in 1 epub, based off NoblemtlParser.js as suggested.
This seems something of an edge case. Does using the auto-parser actually take much time? I would have thought you're just opening "Edit chapter URLs" and deleting everything in it. i.e. Select all, delete.
This is my current workflow. Auto-parser doesn't take much time, just for my purposes it serves no purpose, just a part of the ritual to appease the machine spirit before actually getting to work and downloading an EPUB.
Just that auto-parser, by default, doesn't work with most given websites, so it would help me by eliminating a few clicks and fiddling if it was disabled by default and only enabled by user choice (or on detecting a supported URL).
You can leave the title out of the hyperlink. If there's no title, WebToEpub adds the title that it finds in the chapter. I should clarify, formatting as hyperlinks is still required, because WebToEpub expects a list of {URL, Title} items. So the hyperlink format provides this. We'd just be adding the ability to leave the title as blank
That sounds amazing, and exactly how I wish it worked - most chapters have some sort of <h1>
title that can be picked. I'm glad if the suggestion has provided some interesting ideas!
I have started the implementation of the multi-domain Wizards MtG story scraper here: https://github.com/Darthagnon/web2epub-tidy-script/blob/master/MagicWizardsParser.js (resolves #1300 )
I hate myself for using AI-generated scripts, but I know only very basic JS. Initial testing looks promising; it correctly scrapes chapters from Archive.org and the live website (though titles are duplicated and author names excluded)
@Darthagnon
Just giving it a quick once over.
These lines should not be needed
parserFactory.register("web.archive.org", () => new MagicWizardsParser()); // For archived versions
parserFactory.registerRule(
(url, dom) => MagicWizardsParser.isMagicWizardsTheme(dom) * 0.7,
() => new MagicWizardsParser()
);
WebToEpub knows about the web.archive.org, and will search the rest of the URL for the original site hostname, and apply parser for that.
The lines 19, 30, 57, 67, 113,
if (window.location.hostname.includes("web.archive.org"))
should be (I think, note, not tested)
if (dom.baseURI.includes("web.archive.org"))
This is also not needed
// Detect if the site matches the expected structure for magic.wizards.com or the archived version
static isMagicWizardsTheme(dom) {
// Check if the page is archived
if (window.location.hostname.includes("web.archive.org")) {
// Archived page structure typically wraps the original content in #content
return dom.querySelector("#content article") != null || dom.querySelector("#content .article-content") != null;
}
// Regular magic.wizards.com structure
return dom.querySelector("article") != null || dom.querySelector(".article-content") != null;
}
Swapping if (window.location.hostname.includes("web.archive.org"))
to if (dom.baseURI.includes("web.archive.org"))
breaks it for Archive.org pages (e.g. https://web.archive.org/web/20230127170159/https://magic.wizards.com/en/news/magic-story), though I have applied your other changes.
@Darthagnon
If changing window.location breaks things, something is wrong.
window.location is the URL of the page the browser is showing, which is NOT the same as the current chapter/page that WebToEpub is processing.
The dom parameter passed into the calls is the page being processed, so you want to switch based on it's URL.
Further updates; this is latest v0.4. It's not perfect, but works more or less:
"use strict";
// Register the parser for magic.wizards.com and archive versions
parserFactory.register("magic.wizards.com", () => new MagicWizardsParser());
class MagicWizardsParser extends Parser {
constructor() {
super();
}
// Extract the list of chapter URLs
async getChapterUrls(dom) {
let chapterLinks = [];
if (window.location.hostname.includes("web.archive.org")) {
// For archived versions, select the correct container within #content
chapterLinks = [...dom.querySelectorAll("#content article a, #content .article-content a")];
} else {
// For live pages
chapterLinks = [...dom.querySelectorAll("article a, .article-content a")];
}
// Filter out author links using their URL pattern
chapterLinks = chapterLinks.filter(link => !this.isAuthorLink(link));
return chapterLinks.map(this.linkToChapter);
}
// Helper function to detect if a link is an author link
isAuthorLink(link) {
const href = link.href;
const authorPattern = /\/archive\?author=/;
// Check if the link matches the author URL pattern or CSS selector
if (authorPattern.test(href)) {
return true;
} else {
return false;
}
}
// Format chapter links into a standardized structure
linkToChapter(link) {
let title = link.textContent.trim();
return {
sourceUrl: link.href,
title: title
};
}
// Extract the content of the chapter
findContent(dom) {
if (window.location.hostname.includes("web.archive.org")) {
// For archived pages, the content is often inside #content
return dom.querySelector("#content article");
} else {
// For live pages
return dom.querySelector(".entry-content, article, .article-content");
}
}
}
<a href="https://web.archive.org/web/20230127170159mp_/https://magic.wizards.com/en/news/magic-story/alone"><h3>Phyrexia: All Will Be One | Alone</h3></a>
<article><div><h3>The Call</h3></div><a href="https://magic.wizards.com/en/news/magic-story/call-2015-04-15"></a></article>
)Update: fixed chapter title parsing. It might be ready for prime-time.
@Darthagnon Test versions for Firefox and Chrome have been uploaded to https://github.com/dteviot/WebToEpub/releases/tag/developer-build. Pick the one suitable for you, follow the "How to install from Source (for people who are not developers)" instructions at https://github.com/dteviot/WebToEpub/tree/ExperimentalTabMode#user-content-how-to-install-from-source-for-people-who-are-not-developers and let me know how it goes.
@Darthagnon
Updated version (1.0.1.0) has been submitted to Firefox and Chrome stores. Firefox version is available now. Chrome might be available in a few hours (typical) to 21 days.
I currently use an AI-generated janky Python script to convert a list of URLs into an HTML-formatted list for use with WebToEpub: https://github.com/Darthagnon/web2epub-tidy-script
It works to solve the workflow problems I have with this extension, as explained in a previous issue (quoted below).
Would it be possible to adapt the URL parser to automatically do what I currently use my external script to do?