belaviyo / save-images

Save loaded images in nested iframe pages
https://webextension.org/listing/save-images.html
254 stars 61 forks source link

Sites that hide image links inside an svg put inside an external document loaded into an object tag! #76

Closed AlbertoP64 closed 1 year ago

AlbertoP64 commented 1 year ago

(this issue published also here: https://add0n.com/save-images.html#IDComment1117725792 and here: https://forums.opera.com/topic/35736/download-all-images/24?1671805475376 ) @belaviyo

Suggested Label: Enhancement

Index

Obscure image link protections technique

There are some sites like this: (https://dokumen.tips/) that hide the jpg link in a very obscure way.

Sample page: URL: https://dokumen.tips/documents/topolino-libretto-n-1.html?page=1

Side notes:

Code extract :

(please note that "(user)" comments are mine)

<!-- (user) this div contains a single page with an image inside it, protected from download -->
<div id="p1" style="overflow: hidden; position: absolute; background-color: white; width: 535px; height: 765px; margin: 0px;" class="page-inner">

    <!-- Begin page background -->
    <!-- (user) this div is a transparent overlay used to intercept clicks -->
    <div id="pg1Overlay" style="width: 100%; height: 100%; position: absolute; z-index: 1; background-color: rgba(0, 0, 0, 0); user-select: initial;"/>

    <!-- (user) this div is the image indirect container -->
    <div id="pg1" style="user-select: initial;">

        <!-- (user) object is used to defeat the finding of links, and contains a document that contains in turn an svg -->
        <object width="535" height="765" data="https://reader034.dokumen.tips/reader034/viewer/2022052117/568c2c261a28abd8328c8539/html5/1/1.svg"
                type="image/svg+xml" id="pdf1" style="width:535px; height:765px; -moz-transform:scale(1); z-index: 0;">

            #document
            <!DOCTYPE svg PUBLIC....>

            <!-- (user) the svg contains in turn an Image tag with the link (xlink type) to the true image loaded -->
            <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" viewBox="0 0 535 765" version="1.1">
                <image preserveAspectRatio="none" x="0" y="0" width="535" height="765" xlink:href="img/1.jpg"/>

            </svg>
        </object>
    </div>
    <!-- End page background -->

</div>

Suggested changes

To handle those pages, the developer of the extension should IMHO:

Note that, on the test page linked above, with current extension version, and filtering by "[0-9]*[.]jpg" regexp mask (but it seems ignored!), only thumbnails for the .jpg (around 10 Mb) and the not useful .svg files gets downloaded (apart for other unwanted pics ;-) ).

Also I just suppose that, even after implementing [1], user should have to set Deep level to almost 1, to have those "deep link" (other linked document) being recognized.

Other tests performed

I have tried to load the inner SVG DOCUMENT https://reader034.dokumen.tips/reader034/viewer/2022052117/568c2c261a28abd8328c8539/html5/1/1.svg in a new tab (this shows the Micky Mouse image inside the Svg), then to open the Download All Images extension on it, but it seems that it won't open on that page, I supposed for the page type / file extension. So, I have added point [2] to the task list.

Other info

I'm browsing on a Windows Pc on Opera Desktop latest version.

belaviyo commented 1 year ago

Thank you for the report. The issue is fixed. You can find the actual images with Deep = 0 now.

AlbertoP64 commented 1 year ago

Many thanks to the developer! I will try it soon ;-) šŸ‘

Il giorno dom 8 gen 2023 alle ore 08:57 belaviyo @.***> ha scritto:

Thank you for the report. The issue is fixed. You can find the actual images with Deep = 0 now.

ā€” Reply to this email directly, view it on GitHub https://github.com/belaviyo/save-images/issues/76#issuecomment-1374748179, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHP34H7AJYRAZ2MIXOGK34LWRJXOXANCNFSM6AAAAAATH42ZDM . You are receiving this because you authored the thread.Message ID: @.***>

-- A. Pignocchino

AlbertoP64 commented 1 year ago

Hi @belaviyo, I've thoroughly tested. Sorry to say that changes made, actually won't work :-( .

But, having the site a fully functional PDF download button, I will NOT pledge to reopen issue.

Also, for other reasons (read forth), this site seems tougher than expected. See point 5-6 below.

-:-

If you wants more data, please see attached commented images, and the attached real file list I've managed to build to finally download the .jpg images using your extension. It seems I have written too much on the images, but they should be straightforward to read, after the first bit of fear :-D

Key points are:

  1. the only image found by your extension is not any of the correct ones, has wrong size, is used on the site page only briefly after downloading the site then disappears, and seems to represent an explicit TRAP for downloaders apps, being (intentionally?) set to have wrong size set on download, so to stretch badly.
  2. I could be wrong, but it seems to me that still your extension cannot actually get to the documents loaded inside the Div's, I dunno why. I mean, those already loaded in place (see points 5/6). I'm sorry I cannot really debug what's going on. This is my sensation only.
  3. If I try to open directly one of the Chinese box documents (url in the issue), that have .svg extension but are xml files, I cannot use your extension on it. As previously said in my first bug report, I suppose this is by design, and not necessarily a limit, but this prevents me to further test where the problem lies in, as I cannot force the extension to work on a loaded page with .svg extension / mime type. This can have sense, because normally those files are images, and not XML documents. (Eventually, this could be a limit for the extension, if the inner Chinese box document loaded into Divs, get possibly skipped from evaluation for this reason).
  4. of the attached text files, one is my final URL list for the true images. As you see, all final images have the same name (so I have used the [order] tag to rename them in the extension; worked like a charm).
  5. the site has a JavaScript active that dynamically loads (and possibly unloads) the single images (and their Chinese box container documents) from the vertical scrolling container (named: "idrviewer"). This prevents the page from having all the real images effectively linked at the same time! I looked deep into JavaScript and page code but until now I'm stuck, cannot have all the pics loaded same time. Note: the lazy loading is confirmed by looking at the code in Dev instruments while scrolling down the container. (Note: Idrviewer seems to be a Idrsolution project connected with BuildVU pdf converter. You can find them on the web).
  6. the other text file is a CODE SAMPLE of two image slots in the dynamic vertical container, one in state LOADED and the other in state UNLOADED (empty). At the moment I have not found the code that is responsible of filling dynamically those slots elements with the link to the .svg document fragments. No more data everywhere so the ID ("page 20") should be the only hint for the code to build the slot contents. It seems that the code uses Jquery to load lazily the contents, keeping always at lest ten images loaded. A script in the body seems interesting, but I have not time at the moment to go deeper.

Attachments:

(0) extension settings, and resulting found image (below): 2023 01 12 Image downloader ext Opera - site dokumen tips - 0 ext settings and resulting image

(1) first image shown on scrolling container: 2023 01 12 Image downloader ext Opera - site dokumen tips - 1 first image on scrolling container

(2) hourglass while scrolling down the vertical container, causing delay load of inner documents (confirmed): 2023 01 12 Image downloader ext Opera - site dokumen tips - 2 hourglass while scrolling down the vertical container, and delay loading the inner documents! Note: the lazy loading is confirmed by looking at the hyml code in Dev instruments while scrolling down the container. The inner Divs show changed status (see attachment 3) and inner content added.

(3) [TEXT] Code fragments from site dokumen.tips - dynamically loaded fragments: 2023 01 12 Image.downloader ext Opera - site dokumen.tips - 3 Code fragments from site dokumen.tips - dynamically loaded fragments.txt

(4) [TEXT] _Final Files list I've used to load the pictures: 2023 01 12 Image.downloader ext Opera - site dokumen.tips - 4 _Final Files list.txt