Open machawk1 opened 5 years ago
An update...after running the above, it appears that the cookies on the wiki site at the target URI of the crawl has been removed and I needed to log in again. This is a case of crawling-considered-harmful and an unfortunate side-effect.
EDIT: It appears to have affected the retention of other site cookies (e.g., facebook.com) as well.
@N0taN3rd Per your suggestion, I pulled 9bbc461 and re-installed with the same (above) config.json.
The crawl finished within a couple minutes with an error, which I did not mention in the ticket description but may be relevant for debugging
Crawler Has 8 Seeds Left To Crawl
Crawler Navigating To https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents
A Fatal Error Occurred
Error: options.stripFragment is renamed to options.stripHash
- index.js:35 module.exports
[Squidwarc]/[normalize-url]/index.js:35:9
- _createHybrid.js:87 wrapper
[Squidwarc]/[lodash]/_createHybrid.js:87:15
- puppeteer.js:155 PuppeteerCrawler.navigate
/private/tmp/Squidwarc/lib/crawler/puppeteer.js:155:11
Please Inform The Maintainer Of This Project About It. Information In package.json
Upon re-launching Chrome, some sites where I would have a cookie (inclusive of the ws-dl wiki site) showed that I was no longer logged in. Others, e.g., gmail.com, retained my cookie. EDIT: Google reported a cookie error on subsequent logins (pic).
Viewing the WARC showed that the URI specified to be archived was not present but a capture of the wiki login page was present and replay-able.
As discussed via Slack, making a duplicate of my profile might help resolve this issue. I did so via:
cp -r /Users/machawk1/Library/Application Support/Google/Chrome /tmp/Chrome
...then ran ./bootstrap.sh
and ./run-crawler.sh -c wsdlwiki.config
from the root of my Squidwarc working directory of current master a2f1d6383cbae06ccd5dc315ba88879e85a12ca5.
My macOS 10.14.2 Chrome reports version 72.0.3626.81.
wsdlwiki.config is the same as above but with the path changed to /tmp/Chrome
.
./run-crawler.sh -c wsdlwiki.config
Running Crawl From Config File wsdlwiki.config
With great power comes great responsibility!
Squidwarc is not responsible for ill behaved user supplied scripts!
Crawler Operating In page-only mode
Crawler Will Be Preserving 1 Seeds
Crawler Generated WARCs Will Be Placed At /private/tmp/Squidwarc in appending mode
Crawler Will Be Generating WARC Files Using the filenamified url
A Fatal Error Occurred
Error: Failed to launch chrome!
dlopen /private/tmp/Squidwarc/node_modules/puppeteer/.local-chromium/mac-624487/chrome-mac/Chromium.ap p/Contents/MacOS/../Versions/73.0.3679.0/Chromium Framework.framework/Chromium Framework: dlopen(/priv ate/tmp/Squidwarc/node_modules/puppeteer/.local-chromium/mac-624487/chrome-mac/Chromium.app/Contents/M acOS/../Versions/73.0.3679.0/Chromium Framework.framework/Chromium Framework, 261): image not found
TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md
- Launcher.js:360 onClose
[Squidwarc]/[puppeteer]/lib/Launcher.js:360:14
- Launcher.js:349 Interface.helper.addEventListener
[Squidwarc]/[puppeteer]/lib/Launcher.js:349:50
- events.js:187 Interface.emit
events.js:187:15
- readline.js:379 Interface.close
readline.js:379:8
- readline.js:157 Socket.onend
readline.js:157:10
- events.js:187 Socket.emit
events.js:187:15
- _stream_readable.js:1094 endReadableNT
_stream_readable.js:1094:12
- next_tick.js:63 process._tickCallback
internal/process/next_tick.js:63:19
Please Inform The Maintainer Of This Project About It. Information In package.json
It's interesting and potentially problematic that Squidwarc/puppeteer is trying to use Chrome 73.0.3679.0 per the error. Think the version difference is the issue, @N0taN3rd, or something else?
TBH I am completely unsure at this point.
I have had success on Linux using the same browser (do have to continually re-sign in ever time :goberserk:), tho I do believe if you switch from stable <-> dev <-> unstable that does cause some issues.
The best bet I can think of now is to use a completely new user data dir by initially launching the version of chrome you want with the --user-data-dir=<path to whereever you want it>
, sign into your google profile in chrome and then any of the sites you want to crawl.
That way when you start the crawl that completely new user data dir is unique to that browser.
This will require some additional changes to Squidwarc but I suspect that the issue is with setting the user data dir itself rather than letting the normal browser resolution of that directories path take place.
So if their were an config option to not do anything data-dir / password related the browser would figure it out correctly.
//cc @N0taN3rd
Having to re-sign-in somewhat defeats the purpose of reusing the user data directory. It's reminiscent of the Webrecorder approach (:P) and is not nearly as powerful as reusing existing cookies/logins, if possible.
With regard to the delta of Chrome versions for the system vs. what is used in Squidwarc, is there currently a way to tell Squidwarc to use a certain version of Chrome(ium)? Having that match up and reusing the data dir might be one needed test to see if it persists.
For the truly desperate, I was able to load some cookies by doing the following:
This lets you "load" cookies into the session.
Are you submitting a bug report or a feature request?
Bug report.
What is the current behavior?
https://github.com/N0taN3rd/Squidwarc/blob/master/manual/configuration.md#userdatadir states that a
userDataDir
attribute can be specified to reuse the user directory for a system's Chrome. I use a logged in version of Chrome on my system, so wanted to leverage my logged-in cookies to crawl contents behind authentication using Squidwarc. I specify a config file for Squidwarc:...in an attempt to preserve https://ws-dl.cs.odu.edu/wiki/index.php/User:MatKelly, a URI that will provide a login page if not authenticated. I get the following result on stdout:
The resulting WARC does not contain any records related to the specified URI, oddly, since anonymous access results in an HTTP 200. The URI https://ws-dl.cs.odu.edu/wiki/index.php/Special:Random, however, is shown in the WARC. Replaying this page shows a login interface, indicative that my browser's cookies were not used.
What is the expected behavior?
Squidwarc uses my local Chrome's cookies and captures the page behind authentication, per the manual.
What's your environment?
macOS 10.14.2 Squidwarc a4023352042f4ce707b8564adb62c39e3043a40d (current master) node v10.12.0
Other information
We discussed this informally via Slack. Previously, I experienced this config script borking my Chrome's user directory (i.e., conventionally using Chrome would no longer allow creds to "stick") but can no longer replicate this.