Add parser for mtgstory.com

Darthagnon commented 2 months ago

WIP. Add parser for mtgstory.com (redirects to https://magic.wizards.com/en/story). Seems to work on most versions of the website (e.g. current live version, archive.org version from 2-3 years ago, untested on 10 years ago archive.org version). Still missing fallback support for mtglore.com.

Based on MagicWizardsParser.js v0.6 from https://github.com/Darthagnon/web2epub-tidy-script

Darthagnon commented 2 months ago

For some reason, gitignore was set to ignore additions to the parsers folder, so I have commented that line.

gamebeaker commented 2 months ago

@Darthagnon can you fix the eslint errors? (just push to your branch i think it should update in the merge request) https://github.com/dteviot/WebToEpub/actions/runs/10974479483/job/30473842554?pr=1500

dteviot commented 2 months ago

@Darthagnon

Replacing

findCoverImageUrl(dom) {
    // Try to find an image inside the '.swiper-slide' or inside an 'article'
    let imgElement = dom.querySelector(".swiper-slide img, article img");

    // If an image is found, return its 'src' attribute
    if (imgElement) {
        return imgElement.getAttribute("src");
    // Check if the URL starts with '//' (protocol-relative URL)
        if (imgSrc && imgSrc.startsWith("//")) {
            // Add 'https:' to the start of the URL
            imgSrc = "https:" + imgSrc;
        }
    }
    // Fallback if no image was found
    return null;
}

with

    findCoverImageUrl(dom) {
        return util.getFirstImgSrc(dom, ".swiper-slide img, article img");
    }

will fix your problems.

dteviot commented 2 months ago

@Darthagnon

For some reason, gitignore was set to ignore additions to the parsers folder, so I have commented that line.

You can just remove that line.

dteviot commented 2 months ago

@Darthagnon

Commented out line 5

//parserFactory.register("mtglore.com", () => new MagicWizardsParser());

should be removed.

This

        if (authorPattern.test(href)) {
            return true;
        } else {
            return false;
        }

should be

        return authorPattern.test(href));

I'm not convinced that

if (window.location.hostname.includes("web.archive.org"))

does what you think it does.
I'm lazy. Please provide link links to the two cases it should distinguish between, and I'll check for myself.

gamebeaker commented 2 months ago

@Darthagnon Maybe you can change .gitignore to ignore the new files if someone does npm install (plugin/jszip/dist/jszip.min.js and package-lock.json)

Darthagnon commented 2 months ago

This

       if (authorPattern.test(href)) {
           return true;
       } else {
           return false;
       }

should be

        return authorPattern.test(href));

This change breaks the parser, results in it being unable to pick up any chapters.

plugin/jszip/dist/jszip.min.js was already there, but commented out; restored.

I'm not too sure what to do about the spacing/lint errors in packed.js... I have pack.js but packed.js does not exist. I haven't touched either file and don't know what tool to use to automatically fix them (maybe JSLint or NPPExec with eslint?)

Some test pages:

Live site https://magic.wizards.com/en/story#story-archive (select any of the stories in the timeline carousel other than the default most recent)
Old site https://web.archive.org/web/20160411073205/http://magic.wizards.com/en/articles/columns/magic-story-archive and https://web.archive.org/web/20160412030018/https://magic.wizards.com/en/articles/archive/uncharted-realms/blood-will-have-blood-2014-06-04
Very old site https://web.archive.org/web/20140302084755/http://www.wizards.com/Magic/Magazine/Article.aspx?x=mtg/daily/ur/263
Archive site/redirects https://mtglore.com

I believe the archive.org logic may be needed to account for slight variations in the article selectors over time, but I will keep testing.

Darthagnon commented 2 months ago

Hmmm... definitely WIP, I need to do some more work on it.

gamebeaker commented 2 months ago

@Darthagnon The spacing error message comes from npm run lint this command packages all js files into one file eslint/packed.js and evaluates it/ searches for warnings/ errors. The line from the error message is the line in packed.js as the normal Experimentaltab version has no errors the errors must be in a new file you created changed etc. You have fixed these errors as the github actions which runs this command had no problems.

gamebeaker commented 2 months ago

An easy test is to change the indentation in main.js now run npm run lint in eslint/packed.js you can see the problem in line 23919 revert main.js and run npm run lint there are now no errors and packed.js also changed to reflect the changes.

gamebeaker commented 2 months ago

î guess the problem is here line 18++

Darthagnon commented 2 months ago

Ongoing improvements mean the script now deals quite well with both the 2023-2024 version and 2014-2018 version of the website (v0.72, chapter titles now generalised and correctly selected).

dteviot commented 2 months ago

@Darthagnon

return authorPattern.test(href));

D'oh! Copy/paste mistake on my part. Should only be one closing bracket. i.e.

return authorPattern.test(href);

Darthagnon commented 2 months ago

@gamebreaker No idea where packed.js is from, are you sure that isn't your dev build? I only have pack.js Everything64_240922_148

Everything search for pack [space] .js, which would show up packed.js if it existed. And I haven't touched that file, I have only added MagicWizardsParser.js and edited popup.html, nothing more.

dteviot commented 2 months ago

@Darthagnon

packed.js is created when the build runs and creates the WebToEpub extension. As you're not running the build, you won't see this file on your machine.

I think the lines with the indentation problem are these: https://github.com/dteviot/WebToEpub/blob/fd8c87f07f9b8d1fc5838be2323e3fb56936b5b1/plugin/js/parsers/MagicWizardsParser.js#L60-L62

Should be

            titleElement = link.closest("article")?.querySelector(selector) || 
               link.closest(".article-item")?.querySelector(selector) || 
               link.closest(".details")?.querySelector(selector);

The line following a line ending with a || should be indented 4 more spaces.

dteviot commented 2 months ago

@Darthagnon Give me 10 minutes, I'll run the build using your file and confirm.

dteviot commented 2 months ago

@Darthagnon

I'm wrong, @gamebeaker is correct. In my defense, it was hard to see the highlighted rows in his screenshot. The problem is lines 29 to 32 here https://github.com/dteviot/WebToEpub/blob/fd8c87f07f9b8d1fc5838be2323e3fb56936b5b1/plugin/js/parsers/MagicWizardsParser.js#L27-L33

gamebeaker commented 2 months ago

@Darthagnon here is how you can run lint the first time, you need npm

https://github.com/user-attachments/assets/4f2c5a2b-8905-43eb-8a9b-e21a4643d879

dteviot / WebToEpub

Add parser for mtgstory.com #1500

Some test pages: