dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
676 stars 132 forks source link

[imgur] grab images from imgur #99

Closed typhoon71 closed 7 years ago

typhoon71 commented 7 years ago

I'm pretty sure there was talk about this but I can't seem to find it any more so I made a separate issue for sake of visibility.

An example would be "https://crimsonmagic.me/2016/11/16/gifting-10-1/" [FIXED LINK], where there are various image -links- like "http://imgur.com/K4CZyyP.jpg" that don't get grabbed (or shown).

There's also the more complex case of linked imgur galleries like on "http://skythewoodtl.com/fanfic-gifting/" [ADDED LINK], where you have a link to an imgur gallery with all the illustration before the novel ("Images" at the end).

toshiya44 commented 7 years ago

wrong link, i think

dteviot commented 7 years ago

@typhoon71 I can't see any imgur links at https://github.com/dteviot/ WebToEpub/issues/new. Did you post the correct link? The Japtem site has imgur images and sideshows. The existing parser is able to handle links to individual images, but I have not yet spent time trying to figure out how to parse a slideshow. i.e. Parser has no problem with this page http://japtem.com/projects/dd-toc/v1-illustrations/, with exception of the slideshow, which is just duplicates the other images on the page.

Please provide link to page that has problem.

On Mon, Dec 26, 2016 at 8:37 PM, typhoon71 notifications@github.com wrote:

I'm pretty sure there was talk about this but I can't seem to find it any more so I made a separate issue for sake of visibility.

An example would be "https://github.com/dteviot/WebToEpub/issues/new", where there are various image -links- like "http://i.imgur.com/AAFDx1B.jpg" that don't get grabbed (or shown).

There's also the more complex case of linked imgur galleries, where you have a link to an imgur gallery with all the illustration before the novel.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dteviot/WebToEpub/issues/99, or mute the thread https://github.com/notifications/unsubscribe-auth/AE6w2ZPMmLaLreiPZbR0btTEujIbllP_ks5rL27HgaJpZM4LVnpT .

typhoon71 commented 7 years ago

How did I put "https://github.com/dteviot/" as a link is beyond me; obv it's wrong, sorry. I fixed and updated my post.

dteviot commented 7 years ago

@typhoon71 Thanks for the corrected links

<h6>&lt;TL Note: <a href="http://imgur.com/K4CZyyP.jpg">Insert image</a> &gt;</h6>

This is a hyperlink with no image tag. So parser leaves it alone. In this particular case, I think it's a note for someone to add the image tag. Creating a parser to analyse "bare" hyperlinks and figure out what to do with them in the general case is "very hard and error prone". At this time, I'm not going to attempt to handle this.

"http://skythewoodtl.com/fanfic-gifting/" is a more interesting. At current time, the plugin analyses the first page given and selects the parser based on that. To solve this problem, the plugin needs to check each page and select the parser on a page by page basis. This may require a significant re-write of the plugin. i.e. the text page links need to be processed with the WordpressBase parser, while the imgur links should be processed with an imgur parser. (Which I need to create.) That said, I probably need to do this anyway, as this is the 4th case of a site with links in different formats.)

typhoon71 commented 7 years ago

1) You could make the bare links parsing optional and off, by default (an advanced option), so part of the choices would be on "user side". I understand it can be time consuming, so I won't try to force you (I couldn't anyway XD), I'll go with manual editing then.

2) This is yet another case of skythewoodtl being one of a kind, providing new test cases for this plugin; I bet they don't even know! XD

typhoon71 commented 7 years ago

Just an idea about the bare imgur links.

Since right now they get grabbed as <a href="http://imgur.com/ ... .jpg/png">, shouldn't it be possible to save them as <img src="http://imgur.com/ ... .jpg/png"> instead?

This way I could use the calibre editor ability to download external resources (automatically). OK, I'll need to create the svg block by hand, but the images would be already saved in the epub and linked, so it would cut down some work.

Would this be possible (with some if-then cycle) or it would require too much work? Just asking.

dteviot commented 7 years ago

Yes, it could be done. You'd need to modify ImageCollector.js to:

  1. Scan for <a> tags as well as <img> tags,
  2. Check that these <a> tags are not already wrapping an <img> tag.
  3. Inspect the href attribute of the <a> tag to see if it points to imgur and is an image file
  4. If all are above are correct, replace the hyperlink with a image tag.

However, this isn't a general case fix, it's a specific case one. And in the above example, you'd probably want to remove the h6 tag and the translator note.

Also, it won't fix Imgur links to galleries. (Note, the ZirusmusingParser does handle Imgur galleries. But it needs more work, as the HTML doesn't always include all images in the gallery.

typhoon71 commented 7 years ago

mmm, then I suppose it's faster replacing the tags in calibre. Makes little sense to have you put time on this specific fix which probably will have to be reworked in the future, when you'll work on the imgur parser (I'm positive in the future... in the future... XD).

dteviot commented 7 years ago

@typhoon71 While fixing the general case is difficult, If I treat them as special cases, it's easy to create custom parsers for handling them.

As usual, changes have been checked into Experimental branch. Please have a look to see how it's done.

typhoon71 commented 7 years ago

Thanks! A lot! Nice new year present for me.

1) CrimsonMagicParser is working fine, imgur links are grabbed beautifully.

2) SkythewoodtlParser grabbing imgur galleries fine (adding "?grid" as you said).

3) on "http://skythewoodtl.com/g84/" there are imgur links like "https://i.imgur.com/SvtVqwZ.jpg": is it enough to copy the code from CrimsonMagicParser to SkythewoodtlParser? Or would it be better to put that it into the ImgurParser (so it handles more cases)?

4) Maybe there will be the need to check with those parsers for duplicate images too?

dteviot commented 7 years ago

@typhoon71

Nice new year present for me.

I like to think it's a late Christmas present. Hopefully it will save you some time.

is it enough to copy the code from CrimsonMagicParser to SkythewoodtlParser? Or would it be better to put that it into the ImgurParser (so it handles more cases)?

Good question. Probably the simplest fix is to add the Skythewoodtl logic into CrimsonMagicParser. I think the steps are as follows. (Note, I'm currently on holiday, so can't easily test):

  1. In popup.html, remove the <script> tag that includes SkythewoodtlParser.js
  2. In CrimsonMagicParser, add this line
    parserFactory.register("skythewoodtl.com", function() { return new CrimsonMagicParser() });

    after the line

    parserFactory.register("crimsonmagic.me", function() { return new CrimsonMagicParser() });
  3. In CrimsonMagicParser, change the function findContent() to
    findContent(dom) {
        if (ImgurParser.isImgurGallery(dom)) {
            return ImgurParser.convertGalleryToConventionalForm(dom);
        } 
        let content = super.findContent(dom);
        if (content != null) {
            let that = this;
            let toReplace = util.getElements(content, "a", that.isHyperlinkToReplace);
            for(let hyperlink of toReplace) {
                that.replaceHyperlinkWithImg(hyperlink);
            }
        }
        return content;
    }

Maybe there will be the need to check with those parsers for duplicate images too?

Are you talking about removing images from the galleries that are also included in the chapter text? If so, I've been thinking about that. Solution might be to have a node.js script that does custom processing of epubs after they've been created.

dteviot commented 7 years ago

@typhoon71 Experimental Tab Branch updated as described above.

typhoon71 commented 7 years ago

First thing, thanks a lot. ;) Now the not so good stuff I found out / noticed in my new year tests...

Here's what happens:

1) The image complete gallery link is gone from the fetched links: the previous version would save a "Image" chapter at the end after the epilogue, with all the images based on it. It's the link that needed the ?grid to be added at the end to to have it grabbed. Is this intended? If I add the link manually and add the ?grid it works as it did before. EDIT: if I select the Blogspot parser they do appear.

2) Some images are not fetched, like http://imgur.com/DZYuHnc, http://imgur.com/DZYuHnc; they are present in the global gallery but they have their own single link, so they should get fetched.

3) At the start of the prologue or inside a chapter (volume 9 for example) there's a link to an imgur gallery: this is not processed. If the link is the same as the general gallery then it's fine, one would delete it from the epub. Did you notice it? Now it gives an error (expected image got html).

typhoon71 commented 7 years ago

Argh! I must move that button.

dteviot commented 7 years ago

@typhoon71 Will look into the issues. Which page has the http://imgur.com/DZYuHnc link?

typhoon71 commented 7 years ago

You will find http://imgur.com/DZYuHnc in http://skythewoodtl.com/g93/.

All the images that exibit that behaviour are in volume 9, located at the end of http://skythewoodtl.com/fanfic-gifting/.

dteviot commented 7 years ago

@typhoon71

The image complete gallery link is gone from the fetched links

D'oh! I combined both the "turn imgur link into an image" and "fetch imgur link as gallery logic". So what's happening is: when the first page is loaded and scanned to get the set of links, prior to the scan for links, the gallery links are turned into <img> tags so the scan doesn't find them. The quick fix (now added to Experimental Tab Mode) is change the "turn imgur link into image" logic so it converts only single image links, not NOT gallery links to <img> tags. i.e. Needs to examine the URL more carefully, and only convert jpg, gif, etc. links. Obviously, the long term fix will require

At the start of the prologue or inside a chapter (volume 9 for example) there's a link to an imgur gallery: this is not processed

Actually, it IS processed, but it's turned into an an <img> tag, which the image collector runs thinking it's a single image. The image collector then gets the HTML for a gallery and doesn't know how to handle it. (Which is why you get the error message.) When the "turn imgur link into image" is modified as described in the above step, then the gallery link isn't converted and you don't get the error. Of course, you don't get the gallery either, but then the existing code only fetches a gallery if the gallery link is in the "chapter URLs". (Because to get all the images to put in the gallery the logic needs to fetch the HTML from the image gallery and parse it.) The page parsing logic is currently not able to fetch additional pages. Obviously, the long term fix (lots of work) will require updating parser engine to

dteviot commented 7 years ago

@typhoon71

All the images that exibit that behaviour are in volume 9, located at the end of http://skythewoodtl.com/fanfic-gifting/.

Hmmm.

You know, I'm thinking it might be useful to send the owner of these sites a note about the links and how to fix them. emails are skythewood@gmail.com and re.yun.NS@gmail.com. (Probably easier and faster than waiting for me to update the parser.)

typhoon71 commented 7 years ago

Will check the dirty fix and report later. Thanks in advance.

typhoon71 commented 7 years ago

Not a (late) report actually, I had no time to pack epubs lately (so I didn't read stuff too, sniffle).

I found that this imgur album gives an error while trying to fetch it. Link: http://imgur.com/a/f7Ezg (with or without ?grid)

I don't know if it's caused by the Amazon issue that was around yesterday, but while imgur now works again I can't fetch this gallery.

Note that I was able to fetch it in the past: I am redoing a couple of epubs and noticed it.

Oh, I will try to do what I said in my 2 month old post just above... hopefully.

dteviot commented 7 years ago

@typhoon71 It's working fine for me. What do you get if you try to look at the page directly in your web browser?

typhoon71 commented 7 years ago

If I open the page directly I can see the gallery normally (a page with a lot of images), but I can't grab it with webtoepub from Skythewood. Other imgur links on Skythewood are fine, it's just this one.

This is the error:

Could not find content element for web page 'http://imgur.com/a/f7Ezg?grid'. Error: Could not find content element for web page 'http://imgur.com/a/f7Ezg?grid'. at chrome-extension://pdjgnlommckobgkiialndbmakmkebdjp/js/Parser.js:265:27

I cant think of anything that can cause this, just that the images seems to be different from what I remember (they weren't 2 pages scan before). Oh, I did clear the cache just in case, but no joy.

dteviot commented 7 years ago

@typhoon71 That's really weird. I'm going to need to examine the HTML that you're getting sent when the parser requests the URL http://imgur.com/a/f7Ezg?grid Fortunately this is not difficult. Steps. (Assuming you're running Chrome.)

  1. Go to starting web page and open WebToEpub.
  2. Set up to download the problem URL.
  3. Before clicking "Pack EPUB" Right click on the WebToEpub Tab/Dialog (depending on your config) and select "inspect" from the drop down menu.
  4. On the Developer Tools window that appears, select "Network" from the Menu bar.
  5. Go back to the WebToEpub Tab/Dialog and click "Pack EPUB".
  6. On the Developer Tools window, you will see the URLs being fetched.
  7. Click on http://imgur.com/a/f7Ezg?grid on the list on the left, then click on the "Response" tab on right window pane. This should show the html that was received. Copy and paste that into an empty file and email to me please.
typhoon71 commented 7 years ago

You got mail...

dteviot commented 7 years ago

Thanks. Mystery solved. At first glance, looks like Imgur has changed the JSON holding the image information. So existing code doesn't recognise it. Will need to examine more carefully when I get home, but I think will be simple enough to fix. Only question is why are you getting this new layout, and I'm getting old one? At a guess, I'm getting an older cached version. You're seeing the new layout.

typhoon71 commented 7 years ago

Good to know it wasn't me having cache issues for one time... XD Jokes aside, it means I'm getting old layout on the other pages?!?

dteviot commented 7 years ago

Well, this is embarassing. Looks like there's no change in the imgur JSON layout. I was just looking at the wrong part of the file. However, I did find a bug in the code used to decide "does the web page came from Imgur?" (Which would result in the imgur parser not being used.) Fix has been committed to the ExperimentalTabBranch as usual.

The faulty code assumed that imgur used "imgur.com" or "i.imgur.com" as the only hostnames. I now see that imgur has a lot more. (at minimum, s.imgur.com, m.imgur.com, api.imgur.com). So if the URL you gave the parser had one of these other names, well, it would fail. I do note the URL you quote (http://imgur.com/a/f7Ezg?grid) had an expected hostname, but I'm hoping you simply typed that in, and that's why it's correct.

If the new version doesn't work, there's two things we can try. First Option Send me the skythewood URL you're using that has the http://imgur.com/a/f7Ezg link. Possibly the problem is in that parser. Sudden thought. which SkyTheWood site are you using? skythewoodtl.com is parsed by the CrimsonMagicParser which knows how to handle imgur links. (It tells the ImgurParser to handle them.) skythewood.blogspot.com is parsed by the BlogSpot parser. It does not know how to handle imgur links.

Second option, in the ImgurParser.js file, replace the findImagesList() function with this version.

    static findImagesList(dom) {
        // Ugly hack, need to find the list of images as image links are created dynamically in HTML.
        // Obviously this will break each time imgur change their scripts.
        console.log("Searching for imgur images from " + dom.baseURI);
        for(let script of util.getElements(dom, "script")) {
            let text = script.innerHTML;
            let index = text.indexOf("\"images\":[{\"hash\"");
            if (index !== -1) {
                console.log("Found start of images in JSON");
                text = text.substring(index + 9);
                let endIndex = text.indexOf("}]");
                if (endIndex !== -1) {
                    console.log("Found end of images in JSON");
                    return JSON.parse(text.substring(0, endIndex + 2));
                } else {    
                    console.log("Unable to find end of images in JSON");
                    return;
                }
            }
        }
        console.log("Unable to find start of images in JSON");
    }

It adds logging to say what's going wrong. So, if you open the console (similar to what you did before, except click on "console" tab instead of "Network") before clicking on "Pack EPUB" we should get logging like:

This logging should help me figure out where the problem is.

typhoon71 commented 7 years ago

Updated, checked, it doesn't work.

1) The link to Skythewood is "http://skythewood.blogspot.it/p/youjo-senki.html", so could the domain being "skythewood.blogspot.it" be the cause?

2) I then modified "ImgurParser.js" as described, but the output of the console tab is empty. The plugin throws this message:

Could not find content element for web page 'http://imgur.com/a/f7Ezg?grid'. Error: Could not find content element for web page 'http://imgur.com/a/f7Ezg?grid'. at chrome-extension://pdjgnlommckobgkiialndbmakmkebdjp/js/Parser.js:265:27 at <anonymous>

dteviot commented 7 years ago

@typhoon71

The link to Skythewood is "http://skythewood.blogspot.it/p/youjo-senki.html", so could the domain being "skythewood.blogspot.it" be the cause?

That's the problem. It's a blogspot site, so the blogspot parser is used. (Weird thing about Blogspot, depending on where you are in the world, it changes the host. If I try using the URL you gave (http://skythewood.blogspot.it/p/youjo-senki.html) it sends me to "http://skythewood.blogspot.co.nz/p/youjo-senki.html".) So, the blogspot parser looks for a host name with ".blogspot." anywhere in it. But I digress.

You can do a quick hack to the Blogspot parser to get it to support Imgur. Replace the findContent() function in BlogspotParser.js with this

    findContent(dom) {
        if (ImgurParser.isImgurGallery(dom)) {
            return ImgurParser.convertGalleryToConventionalForm(dom);
        }
        let content = BlogspotParser.FindContentElement(dom);
        if (content == null) {
            content = util.getElement(dom, "div", e => e.className.startsWith("entry-content"));
        }
        return content;
    }

i.e. add this to the start of the function.

        if (ImgurParser.isImgurGallery(dom)) {
            return ImgurParser.convertGalleryToConventionalForm(dom);
        }
typhoon71 commented 7 years ago

OK, I changed the code in BlogspotParser.js; the ImgurParser.js has updated code too. Result: it works! Thanks a lot!

On a side note, I noticed that if I don't use the "?grid" at the end of the imgur link... it works anyway! I only checked on "http://imgur.com/a/f7Ezg", but it's interesting, will check on more stuff later.

dteviot commented 7 years ago

@typhoon71

I noticed that if I don't use the "?grid" at the end of the imgur link... it works anyway!

It depends on the number of images in the gallery. If you add ?grid, Imgur supplies all the images. If you leave it off, it only supplies the first 16 or 20. (I don't remember which.)
http://imgur.com/a/f7Ezg has 17 images.

typhoon71 commented 7 years ago

Thanks for the explanation; since all 17 images are grabbed, it should be the first 20 images. Can't wait for this to be merged. Btw, it's a shame mozilla so slow lately, they're behind way too much.

dteviot commented 7 years ago

@typhoon71 You might like to get the latest ExperimentalTabBranch. I've committed the fix for the Blogspot parser and also fixed these issues as well in the CrimsonMagicParser. (i.e. it's now replacing links to imgur galleries with links to the actual images.)

typhoon71 commented 7 years ago

Last branch seems fine, also it's nice to have it to automatically add "?grid".

I did find an issue here "https://bakapervert.wordpress.com/vol-19/": both the Illustration links link to a page with a imgur link inside it; the thing is that even if I edit manually that imgur link into the Chapter list it doesn't fetch it. I can't seem to just get those 2 imgur links/galleries alone either.

dteviot commented 7 years ago

@typhoon71

I did find an issue here "https://bakapervert.wordpress.com/vol-19/"

That's because only the CrimsonMagicParser currently has handling for Imgur pages/links. And that parser only recognises https://crimsonmagic.me/ and http://skythewoodtl.com/. For https://bakapervert.wordpress.com/vol-19/, the WordpressBaseParser is used. I think the long term fix is to update the ImageCollector to handle imgur gallery links. but a short term fix is register a binding for the bakapervert site to the CrimsonMagicParser. i.e. Add this line to the top of CrimsonMagicParser.js, where the other parserFactory.register lines are

parserFactory.register("bakapervert.wordpress.com", function() { return new CrimsonMagicParser() });
typhoon71 commented 7 years ago

OK, binding done. It does indeed grab the images now, which is what I wanted.

The only issue is that the pics get cut; I will check later if putting the imgur link in a separate chapter link solvs this (removing the dupes from the epub later). --> Checked, it works

Also, thanks a lot for working on this (one should at least say "thanks" sometime right?)

dteviot commented 7 years ago

@typhoon71

The only issue is that the pics get cut; I will check later if putting the imgur link in a separate chapter link solvs this

I'm not sure what you mean by this. Can you provide more details, please?

Also, thanks a lot for working on this

You're very welcome. But, yes, it's nice getting thanks.

typhoon71 commented 7 years ago

More info about the "pics get cut" bit:

This doesn't happen if you grab the imgur gallery directly (putting a direct link to the imgur gallery in the chapter list).

[ In fact I just did that for v19. For v20 I kept the original link, added a direct one and removed the dupes manually. ]