N0taN3rd / Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
https://n0tan3rd.github.io/Squidwarc/
Apache License 2.0
166 stars 26 forks source link

Instagram pages not recorded properly? #26

Closed peterk closed 5 years ago

peterk commented 6 years ago

Are you submitting a bug report or a feature request?

bug report

What is the current behavior?

Tried to capture an Instagram page (https://www.instagram.com/visit_berlin/) using the following config:

    {
      "jobid": "973f4eee0c103ddcb3dc1e7d839630d0",
      "headless": true,
      "mode": "page-only",
      "depth": 1,
      "seeds": [
        "https://www.instagram.com/visit_berlin/"
      ],
      "warc": {
        "naming": "url",
        "output": "/archive/973f4eee0c103ddcb3dc1e7d839630d0"
      },
      "connect": {
        "launch": false,
        "host": "localhost",
        "port": 9222
      },
      "crawlControl": {
        "globalWait": 60000,
        "inflightIdle": 1000,
        "numInflight": 2,
        "navWait": 8000
      }
    }

The capture seems to work correctly but the resulting warc can not be played back properly (images are missing). I can not see if images have been recorded properly in the Warc. Maybe a problem when saving the images?

This is how it looks in Webrecorder Player and pywb:

image

What is the expected behavior?

Captured Instagram page should be able to play back with images.

Running with Chromium 64.0.3282.168 on Alpine Linux 3.7

peterk commented 6 years ago

Saw a similar issue with webrecorder. img srcset resources are not saved and when I play back the warc on a high-DPI screen it fails to find the right image resource. Would be great if Squidwarc could fetch the srcset resources automatically (or allow to inject a script that prefetched them for certain sites). This should rather be a feature request I guess.

N0taN3rd commented 6 years ago

@peterk Thanks for this issue and cross refing the webrecorder one. This functionality will be added to Squidwarc shortly.

In the interim if you would like to track the progress of Pywb PR #359 in order to gain insight into how this functionality will likely be implemented in Squidwarc.

However the two implementation of this feature, Squidwarcs and Webrecorders, will differ in two aspects:

  1. Squidwarc's will favor general automation with no existing framework to back it (Squidwarc is a middlemanless [recorderless] solution to high-fidelity preservation)
  2. Webrecorder's will favor integration into it's existing rewriting and recording framework
peterk commented 6 years ago

This is now solved with the srcset user script example. Archiving a typical instagram page now records almost double the amount of image data in the WARC.