N0taN3rd / Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
https://n0tan3rd.github.io/Squidwarc/
Apache License 2.0
168 stars 26 forks source link

Unable to reuse local Chrome user dir/cookies #39

Open machawk1 opened 5 years ago

machawk1 commented 5 years ago

Are you submitting a bug report or a feature request?

Bug report.

What is the current behavior?

https://github.com/N0taN3rd/Squidwarc/blob/master/manual/configuration.md#userdatadir states that a userDataDir attribute can be specified to reuse the user directory for a system's Chrome. I use a logged in version of Chrome on my system, so wanted to leverage my logged-in cookies to crawl contents behind authentication using Squidwarc. I specify a config file for Squidwarc:

{ "use": "puppeteer", "headless": true, "script": "./userFns.js", "mode": "page-all-links", "depth": 1, "seeds": [ "https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly" ], "warc": { "naming": "url", "append": true }, "connect": { "launch": true, "host": "localhost", "port": 9222, "userDataDir": "/Users/machawk1/Library/Application Support/Google/Chrome" }, "crawlControl": { "globalWait": 5000, "inflightIdle": 1000, "numInflight": 2, "navWait": 8000 } }

...in an attempt to preserve https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly, a URI that will provide a login page if not authenticated. I get the following result on stdout:

Running Crawl From Config File /Users/machawk1/Desktop/squidwarcWithCookies.json With great power comes great responsibility! Squidwarc is not responsible for ill behaved user supplied scripts! Crawler Operating In page-all-links mode Crawler Will Be Preserving 1 Seeds Crawler Will Be Generating WARC Files Using the filenamified url Crawler Generated WARCs Will Be Placed At /private/tmp/Squidwarc Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly Running user script Crawler Generating WARC Crawler Has 18 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#column-one Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#column-one Running user script Crawler Generating WARC Crawler Has 17 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#searchInput Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly#searchInput Running user script Crawler Generating WARC Crawler Has 16 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly&returntoquery= Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly&returntoquery= Running user script Crawler Generating WARC Crawler Has 15 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Badtitle Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Badtitle Running user script Crawler Generating WARC Crawler Has 14 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:UserLogin&returnto=User%3AMatKelly Running user script Crawler Generating WARC Crawler Has 13 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Main_Page Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Main_Page Running user script Crawler Generating WARC Crawler Has 12 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Community_portal Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Community_portal Running user script Crawler Generating WARC Crawler Has 11 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Current_events Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/ODU_WS-DL_Wiki:Current_events Running user script Crawler Generating WARC Crawler Has 10 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:RecentChanges Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:RecentChanges Running user script Crawler Generating WARC Crawler Has 9 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random Running user script Crawler Generating WARC Crawler Has 8 Seeds Left To Crawl Crawler Navigating To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents Crawler Navigated To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents Running user script Crawler Generating WARC Crawler Has 7 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Localhelppage Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Localhelppage Running user script Crawler Generating WARC Crawler Has 6 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php/Special:SpecialPages Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php/Special:SpecialPages Running user script Crawler Generating WARC Crawler Has 5 Seeds Left To Crawl Crawler Navigating To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:Badtitle&printable=yes Crawler Navigated To https://ws-dl.cs.odu.edu/wiki/index.php?title=Special:Badtitle&printable=yes Running user script Crawler Generating WARC Crawler Has 4 Seeds Left To Crawl Crawler Navigating To https://www.mediawiki.org/ A Fatal Error Occurred Error: options.stripFragment is renamed to options.stripHash - index.js:35 module.exports [Squidwarc]/[normalize-url]/index.js:35:9 - _createHybrid.js:87 wrapper [Squidwarc]/[lodash]/_createHybrid.js:87:15 - puppeteer.js:155 PuppeteerCrawler.navigate /private/tmp/Squidwarc/lib/crawler/puppeteer.js:155:11 Please Inform The Maintainer Of This Project About It. Information In package.json

The resulting WARC does not contain any records related to the specified URI, oddly, since anonymous access results in an HTTP 200. The URI https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random, however, is shown in the WARC. Replaying this page shows a login interface, indicative that my browser's cookies were not used.

What is the expected behavior?

Squidwarc uses my local Chrome's cookies and captures the page behind authentication, per the manual.

What's your environment?

macOS 10.14.2 Squidwarc a4023352042f4ce707b8564adb62c39e3043a40d (current master) node v10.12.0

Other information

We discussed this informally via Slack. Previously, I experienced this config script borking my Chrome's user directory (i.e., conventionally using Chrome would no longer allow creds to "stick") but can no longer replicate this.

machawk1 commented 5 years ago

An update...after running the above, it appears that the cookies on the wiki site at the target URI of the crawl has been removed and I needed to log in again. This is a case of crawling-considered-harmful and an unfortunate side-effect.

EDIT: It appears to have affected the retention of other site cookies (e.g., facebook.com) as well.

machawk1 commented 5 years ago

@N0taN3rd Per your suggestion, I pulled 9bbc461 and re-installed with the same (above) config.json.

The crawl finished within a couple minutes with an error, which I did not mention in the ticket description but may be relevant for debugging

Crawler Has 8 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
A Fatal Error Occurred
  Error: options.stripFragment is renamed to options.stripHash

  - index.js:35 module.exports
    [Squidwarc]/[normalize-url]/index.js:35:9

  - _createHybrid.js:87 wrapper
    [Squidwarc]/[lodash]/_createHybrid.js:87:15

  - puppeteer.js:155 PuppeteerCrawler.navigate
    /private/tmp/Squidwarc/lib/crawler/puppeteer.js:155:11

Please Inform The Maintainer Of This Project About It. Information In package.json

Upon re-launching Chrome, some sites where I would have a cookie (inclusive of the ws-dl wiki site) showed that I was no longer logged in. Others, e.g., gmail.com, retained my cookie. EDIT: Google reported a cookie error on subsequent logins (pic).

Viewing the WARC showed that the URI specified to be archived was not present but a capture of the wiki login page was present and replay-able.

machawk1 commented 5 years ago

As discussed via Slack, making a duplicate of my profile might help resolve this issue. I did so via:

cp -r /Users/machawk1/Library/Application Support/Google/Chrome /tmp/Chrome

...then ran ./bootstrap.shand ./run-crawler.sh -c wsdlwiki.config from the root of my Squidwarc working directory of current master a2f1d6383cbae06ccd5dc315ba88879e85a12ca5.

My macOS 10.14.2 Chrome reports version 72.0.3626.81.

wsdlwiki.config is the same as above but with the path changed to /tmp/Chrome.

./run-crawler.sh -c wsdlwiki.config
Running Crawl From Config File wsdlwiki.config
With great power comes great responsibility!
Squidwarc is not responsible for ill behaved user supplied scripts!

Crawler Operating In page-only mode
Crawler Will Be Preserving 1 Seeds
Crawler Generated WARCs Will Be Placed At /private/tmp/Squidwarc in appending mode
Crawler Will Be Generating WARC Files Using the filenamified url
A Fatal Error Occurred
  Error: Failed to launch chrome!
  dlopen /private/tmp/Squidwarc/node_modules/puppeteer/.local-chromium/mac-624487/chrome-mac/Chromium.ap  p/Contents/MacOS/../Versions/73.0.3679.0/Chromium Framework.framework/Chromium Framework: dlopen(/priv  ate/tmp/Squidwarc/node_modules/puppeteer/.local-chromium/mac-624487/chrome-mac/Chromium.app/Contents/M  acOS/../Versions/73.0.3679.0/Chromium Framework.framework/Chromium Framework, 261): image not found
  TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md

  - Launcher.js:360 onClose
    [Squidwarc]/[puppeteer]/lib/Launcher.js:360:14

  - Launcher.js:349 Interface.helper.addEventListener
    [Squidwarc]/[puppeteer]/lib/Launcher.js:349:50

  - events.js:187 Interface.emit
    events.js:187:15

  - readline.js:379 Interface.close
    readline.js:379:8

  - readline.js:157 Socket.onend
    readline.js:157:10

  - events.js:187 Socket.emit
    events.js:187:15

  - _stream_readable.js:1094 endReadableNT
    _stream_readable.js:1094:12

  - next_tick.js:63 process._tickCallback
    internal/process/next_tick.js:63:19

Please Inform The Maintainer Of This Project About It. Information In package.json

It's interesting and potentially problematic that Squidwarc/puppeteer is trying to use Chrome 73.0.3679.0 per the error. Think the version difference is the issue, @N0taN3rd, or something else?

N0taN3rd commented 5 years ago

TBH I am completely unsure at this point.

I have had success on Linux using the same browser (do have to continually re-sign in ever time :goberserk:), tho I do believe if you switch from stable <-> dev <-> unstable that does cause some issues.

The best bet I can think of now is to use a completely new user data dir by initially launching the version of chrome you want with the --user-data-dir=<path to whereever you want it>, sign into your google profile in chrome and then any of the sites you want to crawl.

That way when you start the crawl that completely new user data dir is unique to that browser.

N0taN3rd commented 5 years ago

This will require some additional changes to Squidwarc but I suspect that the issue is with setting the user data dir itself rather than letting the normal browser resolution of that directories path take place.

So if their were an config option to not do anything data-dir / password related the browser would figure it out correctly.

//cc @N0taN3rd

machawk1 commented 5 years ago

Having to re-sign-in somewhat defeats the purpose of reusing the user data directory. It's reminiscent of the Webrecorder approach (:P) and is not nearly as powerful as reusing existing cookies/logins, if possible.

With regard to the delta of Chrome versions for the system vs. what is used in Squidwarc, is there currently a way to tell Squidwarc to use a certain version of Chrome(ium)? Having that match up and reusing the data dir might be one needed test to see if it persists.

Mauville commented 2 years ago

For the truly desperate, I was able to load some cookies by doing the following:

  1. Set a breakpoint on a line on the project and start debugging
  2. Wait for the browser to load
  3. Manually login into the sites you want in another tab to store the cookie
  4. Resume execution

This lets you "load" cookies into the session.