dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
735 stars 139 forks source link

Add parser for mtgstory.com #1500

Closed Darthagnon closed 1 month ago

Darthagnon commented 2 months ago

WIP. Add parser for mtgstory.com (redirects to https://magic.wizards.com/en/story). Seems to work on most versions of the website (e.g. current live version, archive.org version from 2-3 years ago, untested on 10 years ago archive.org version). Still missing fallback support for mtglore.com.

Based on MagicWizardsParser.js v0.6 from https://github.com/Darthagnon/web2epub-tidy-script

Darthagnon commented 2 months ago

For some reason, gitignore was set to ignore additions to the parsers folder, so I have commented that line.

gamebeaker commented 2 months ago

@Darthagnon can you fix the eslint errors? (just push to your branch i think it should update in the merge request) https://github.com/dteviot/WebToEpub/actions/runs/10974479483/job/30473842554?pr=1500 image

dteviot commented 2 months ago

@Darthagnon

Replacing

findCoverImageUrl(dom) {
    // Try to find an image inside the '.swiper-slide' or inside an 'article'
    let imgElement = dom.querySelector(".swiper-slide img, article img");

    // If an image is found, return its 'src' attribute
    if (imgElement) {
        return imgElement.getAttribute("src");
    // Check if the URL starts with '//' (protocol-relative URL)
        if (imgSrc && imgSrc.startsWith("//")) {
            // Add 'https:' to the start of the URL
            imgSrc = "https:" + imgSrc;
        }
    }
    // Fallback if no image was found
    return null;
}

with

    findCoverImageUrl(dom) {
        return util.getFirstImgSrc(dom, ".swiper-slide img, article img");
    }

will fix your problems.

dteviot commented 2 months ago

@Darthagnon

For some reason, gitignore was set to ignore additions to the parsers folder, so I have commented that line.

You can just remove that line.

dteviot commented 2 months ago

@Darthagnon

Commented out line 5

//parserFactory.register("mtglore.com", () => new MagicWizardsParser());

should be removed.

This

        if (authorPattern.test(href)) {
            return true;
        } else {
            return false;
        }

should be

        return authorPattern.test(href));

I'm not convinced that

if (window.location.hostname.includes("web.archive.org"))

does what you think it does.
I'm lazy. Please provide link links to the two cases it should distinguish between, and I'll check for myself.

gamebeaker commented 2 months ago

@Darthagnon Maybe you can change .gitignore to ignore the new files if someone does npm install (plugin/jszip/dist/jszip.min.js and package-lock.json)

Darthagnon commented 2 months ago

This

       if (authorPattern.test(href)) {
           return true;
       } else {
           return false;
       }

should be

        return authorPattern.test(href));

This change breaks the parser, results in it being unable to pick up any chapters.

plugin/jszip/dist/jszip.min.js was already there, but commented out; restored.

I'm not too sure what to do about the spacing/lint errors in packed.js... I have pack.js but packed.js does not exist. I haven't touched either file and don't know what tool to use to automatically fix them (maybe JSLint or NPPExec with eslint?)

Some test pages:

I believe the archive.org logic may be needed to account for slight variations in the article selectors over time, but I will keep testing.

Darthagnon commented 2 months ago

Hmmm... definitely WIP, I need to do some more work on it.

gamebeaker commented 2 months ago

@Darthagnon The spacing error message comes from npm run lint this command packages all js files into one file eslint/packed.js and evaluates it/ searches for warnings/ errors. The line from the error message is the line in packed.js as the normal Experimentaltab version has no errors the errors must be in a new file you created changed etc. You have fixed these errors as the github actions which runs this command had no problems. image

gamebeaker commented 2 months ago

An easy test is to change the indentation in main.js image now run npm run lint image in eslint/packed.js you can see the problem in line 23919 image revert main.js and run npm run lint there are now no errors and packed.js also changed to reflect the changes. image

gamebeaker commented 2 months ago

image î guess the problem is here image line 18++

Darthagnon commented 2 months ago

Ongoing improvements mean the script now deals quite well with both the 2023-2024 version and 2014-2018 version of the website (v0.72, chapter titles now generalised and correctly selected).

dteviot commented 2 months ago

@Darthagnon

return authorPattern.test(href));

D'oh! Copy/paste mistake on my part. Should only be one closing bracket. i.e.

return authorPattern.test(href);
Darthagnon commented 2 months ago

@gamebreaker No idea where packed.js is from, are you sure that isn't your dev build? I only have pack.js Everything64_240922_148

Everything search for pack [space] .js, which would show up packed.js if it existed. And I haven't touched that file, I have only added MagicWizardsParser.js and edited popup.html, nothing more.

dteviot commented 2 months ago

@Darthagnon

packed.js is created when the build runs and creates the WebToEpub extension. As you're not running the build, you won't see this file on your machine.

I think the lines with the indentation problem are these: https://github.com/dteviot/WebToEpub/blob/fd8c87f07f9b8d1fc5838be2323e3fb56936b5b1/plugin/js/parsers/MagicWizardsParser.js#L60-L62

Should be

            titleElement = link.closest("article")?.querySelector(selector) || 
               link.closest(".article-item")?.querySelector(selector) || 
               link.closest(".details")?.querySelector(selector);

The line following a line ending with a || should be indented 4 more spaces.

dteviot commented 2 months ago

@Darthagnon Give me 10 minutes, I'll run the build using your file and confirm.

dteviot commented 2 months ago

@Darthagnon

I'm wrong, @gamebeaker is correct. In my defense, it was hard to see the highlighted rows in his screenshot. The problem is lines 29 to 32 here https://github.com/dteviot/WebToEpub/blob/fd8c87f07f9b8d1fc5838be2323e3fb56936b5b1/plugin/js/parsers/MagicWizardsParser.js#L27-L33

gamebeaker commented 2 months ago

@Darthagnon here is how you can run lint the first time, you need npm

https://github.com/user-attachments/assets/4f2c5a2b-8905-43eb-8a9b-e21a4643d879