ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Can't evaluate Select #181

Closed TheTechRobo closed 3 years ago

TheTechRobo commented 3 years ago

I am trying to back up a site.

Here is what I run and the output.

$ grab-site --concurrency 1 --delay 1000 https://www.khanacademy.org/test-prep/mcat
psutil: No module named 'psutil'. Resource monitoring will be unavailable.
Manhole[16762:1620245077.6276]: Patched <built-in function fork> and <built-in function forkpty>.
Manhole[16762:1620245077.6541]: Manhole UDS path: /tmp/manhole-16762
Manhole[16762:1620245077.6544]: Waiting for new connection (in pid:16762) ...
Created lmdb db with map_size=1099511627776
Imported /home/thetechrobo/grab-site/www.khanacademy.org-test-prep-mcat-2021-05-05-37e0452b/igsets
Using these 191 ignores:
        %25252525
        /%22%20\+[^/]+\+%20%22
        /%22\+[^/]+\+%22
        /%27%20\+[^/]+\+%20%27
        /%27\+[^/]+\+%27
        /%5C/%5C/
        /'\+[^/]+\+'
        /(%5C)+(%22|%27)
        /App_Themes/.+/App_Themes/
        /\\+(%22|%27)
        /\\+["']
        /\\/\\/
        /bxSlider/.+/bxSlider/
        /bxSlider/bxSlider/
        /clientscript/.+/clientscript/clientscript/
        /clientscript/clientscript/.+/clientscript/
        /clientscript/clientscript/clientscript/
        /css/.+/css/css/
        /css/css/.+/css/
        /css/css/css/
        /images/.+/images/images/
        /images/images/.+/images/
        /images/images/images/
        /img/.+/img/img/
        /img/img/.+/img/
        /img/img/img/
        /js/.+/js/js/
        /js/js/.+/js/
        /js/js/js/
        /lib/exe/.*lib[-_]exe[-_]lib[-_]exe[-_]
        /scripts/.+/scripts/scripts/
        /scripts/scripts/.+/scripts/
        /scripts/scripts/scripts/
        /slides/.+/slides/slides/
        /slides/slides/.+/slides/
        /slides/slides/slides/
        /styles/.+/styles/styles/
        /styles/styles/.+/styles/
        /styles/styles/styles/
        ^https?://((s-)?static\.ak\.fbcdn\.net|(connect\.|www\.)?facebook\.com)/connect\.php/js/.*rsrc\.php
        ^https?://([^/]+\.)?gdcvault\.com(/.*/|/)(fonts(/.*/|/)fonts/|css(/.*/|/)css/|img(/.*/|/)img/)
        ^https?://([^\./]+\.)?stream\.publicradio\.org/
        ^https?://([^\.]+\.)?pinterest\.com/pin/create/
        ^https?://(\d|www|secure)\.gravatar\.com/avatar/ad516503a11cd5ca435acc9bb6523536
        ^https?://(apis|plusone)\.google\.com/_/\+1/
        ^https?://(audio\d?|nfw)\.video\.ria\.ru/
        ^https?://(ssl\.|www\.)?reddit\.com/(login\?dest=|submit\?|static/button/button)
        ^https?://(www\.)?(megaupload|filesonic|wupload)\.com/
        ^https?://(www\.)?digg\.com/submit\?
        ^https?://(www\.)?facebook\.com/(plugins/(share_button|like(box)?)\.php|sharer/sharer\.php|sharer?\.php|dialog/(feed|share))\?
        ^https?://(www\.)?facebook\.com/v[\d\.]+/plugins/like\.php
        ^https?://(www\.)?friendfeed\.com/share\?
        ^https?://(www\.)?instapaper\.com/hello2\?
        ^https?://(www\.)?myspace\.com/Modules/PostTo/
        ^https?://(www\.)?stumbleupon\.com/(submit\?|badge/embed/)
        ^https?://(www\.)?technorati\.com/faves/?\?add=
        ^https?://(www\.)?twitter\.com/(share\?|intent/((re)?tweet|favorite)|home/?\?status=|\?status=)
        ^https?://(www\.)?xing\.com/(app/user\?op=share|social_plugins/share\?)
        ^https?://(www|draft)\.blogger\.com/(navbar\.g|post-edit\.g|delete-comment\.g|comment-iframe\.g|share-post\.g|email-post\.g|blog-this\.g|delete-backlink\.g|rearrange|blog_this\.pyra)\?
        ^https?://(www|px\.srvcs)\.tumblr\.com/(impixu\?|share(/link/?)?\?|reblog/)
        ^https?://(www|ssl)\.google-analytics\.com/(r/)?(__utm\.gif|collect\?)
        ^https?://.+/.+/disqus\.com/forums/$
        ^https?://.+/js-agent\.newrelic\.com/nr-\d{3}(\.min)?\.js$
        ^https?://.+/js/chartbeat\.js$
        ^https?://.+/stats\.g\.doubleclick\.net/dc\.js$
        ^https?://.+\.blogspot\.(com|in|com\.au|co\.uk|jp|co\.nz|ca|de|it|fr|se|sg|es|pt|com\.br|ar|mx|kr)/(\d{4}/\d{2}/|search/label/)(CSI/$|.*/CSI/CSI/CSI/)
        ^https?://[^/]*musicproxy\.s12\.de/
        ^https?://[^/]+/.+/CaptchaImage\.axd
        ^https?://[^/]+/anony/mjpg\.cgi$
        ^https?://[^/]+/mjpg/video\.mjpg
        ^https?://[^/]+\.akadostream\.ru(:\d+)?/
        ^https?://[^/]+\.corp\.ne1\.yahoo\.com/
        ^https?://[^/]+\.facebook\.com/login\.php
        ^https?://[^/]+\.gaduradio\.pl/
        ^https?://[^/]+\.libsyn\.com/.+/%2[02]https?:/
        ^https?://[^/]+\.rastream\.com(:\d+)?/
        ^https?://[^/]+\.services\.livejournal\.com/ljcounter
        ^https?://[^/]+\.streamtheworld\.com/
        ^https?://[^/]+\.xiti\.com/hit\.xiti\?
        ^https?://[^\./]+\.radioscoop\.(com|net):\d+/
        ^https?://[^\./]+\.streamchan\.org:\d+/
        ^https?://[^\.]+\.livejournal\.com/.+/\*sup_ru/ru/UTF-8/
        ^https?://[^\.]+\.livejournal\.com/.+http://[^\.]+\.livejournal\.com/
        ^https?://[a-z0-9]+\.cdn\.dvmr\.fr(:\d+)?/.+\.mp3
        ^https?://\d+\.media\.tumblr\.com/avatar_.+_16\.pn[gj]$
        ^https?://accounts\.google\.com/(SignUp|ServiceLogin|AccountChooser|a/UniversalLogin)
        ^https?://add\.my\.yahoo\.com/(rss|content)\?
        ^https?://air\.radiorecord\.ru(:\d+)?/
        ^https?://alb\.reddit\.com/
        ^https?://api\.addthis\.com/
        ^https?://audio\d?\.radioreference\.com/
        ^https?://audiots\.scdn\.arkena\.com/
        ^https?://av\.rasset\.ie/av/live/
        ^https?://b\.hatena\.ne\.jp/add\?
        ^https?://b\.scorecardresearch\.com/
        ^https?://beacon\.wikia-services\.com/
        ^https?://bookmark\.naver\.com/post\?
        ^https?://bufferapp\.com/add\?
        ^https?://connect\.mail\.ru/share\?
        ^https?://csp\.cyworld\.com/bi/bi_recommend_pop\.php\?
        ^https?://del\.icio\.us/post\?
        ^https?://delicious\.com/(save|post)\?
        ^https?://download\.ted\.com/
        ^https?://flattr.com/submit/auto\?
        ^https?://gcnplayer\.gcnlive\.com/.+
        ^https?://geo\.yahoo\.com/b\?
        ^https?://getpocket\.com/(save|edit)/?\?
        ^https?://i\.dev\.cdn\.turner\.com/
        ^https?://imageshack\.com/lost$
        ^https?://iwiw\.hu/pages/share/share\.jsp\?
        ^https?://mail\.google\.com/mail/
        ^https?://media\.opb\.org/clips/embed/.+\.js$
        ^https?://medium\.com/_/(vote|bookmark|subscribe)/
        ^https?://memori(\.qip)?\.ru/link/\?
        ^https?://mp3\.ffh\.de/
        ^https?://mp3tslg\.tdf-cdn\.com/
        ^https?://myweb2\.search\.yahoo\.com/myresults/bookmarklet\?
        ^https?://news\.ycombinator\.com/submitlink\?
        ^https?://p\.opt\.fimserve\.com/
        ^https?://photobucket\.com/.+/albums/.+/albums/
        ^https?://pixel\.(quantserve|wp)\.com/
        ^https?://pixel\.blog\.hu/
        ^https?://pixel\.redditmedia\.com/pixel/
        ^https?://platform\d?\.twitter\.com/widgets/tweet_button.html\?
        ^https?://play(\d+)?\.radio13\.ru:8000/
        ^https?://plus\.google\.com/share\?
        ^https?://posterous\.com/share\?
        ^https?://prod-preview\.wired\.com/
        ^https?://pub(\d+)?\.di\.fm/
        ^https?://r-a-d\.io/.+\.mp3$
        ^https?://r-login\.wordpress\.com/remote-login\.php
        ^https?://relay\.broadcastify\.com/
        ^https?://reporter\.es\.msn\.com/\?fn=contribute
        ^https?://s\d+\.sitemeter\.com/(js/counter\.js|meter\.asp)
        ^https?://service\.weibo\.com/share/share\.php\?
        ^https?://share\.flipboard\.com/bookmarklet/popout\?
        ^https?://social-plugins\.line\.me/lineit/share
        ^https?://sphinn\.com/index\.php\?c=post&m=submit&
        ^https?://static\.licdn\.com/sc/p/.+/f//
        ^https?://static\.licdn\.com/sc/p/com\.linkedin\.nux(:|%3A)nux-static-content(\+|%2B)[\d\.]+/f/
        ^https?://stream(\d+)?\.media\.rambler\.ru/
        ^https?://telegram\.me/share/url\?
        ^https?://tm\.uol\.com\.br/h/.+/h/
        ^https?://tmz\.vo\.llnwd\.net/
        ^https?://upload\.wikimedia\.org/wikipedia/[^/]+/thumb/
        ^https?://video-subtitle\.tedcdn\.com/
        ^https?://vkontakte\.ru/share\.php\?
        ^https?://vuible\.com/pins-settings/
        ^https?://web\.archive\.org/web/[^/]+/https?\:/[^/]+\.addthis\.com/.+/static/.+/static/
        ^https?://wow\.ya\.ru/posts_(add|share)_link\.xml\?
        ^https?://www\.addthis\.com/bookmark\.php\?
        ^https?://www\.addtoany\.com/(add_to/|share_save\?)
        ^https?://www\.amazon\.com/.+/logging/log-action\.html
        ^https?://www\.blinklist\.com/index\.php\?Action=Blink/addblink\.php
        ^https?://www\.blogger\.com/feeds/\d+/\d+/comments/default/\d+
        ^https?://www\.blogger\.com/feeds/\d+/posts/default/\d+
        ^https?://www\.dreamwidth\.org/tools/(memadd|tellafriend)\?
        ^https?://www\.flickr\.com/(explore/|photos/[^/]+/(sets/\d+/(page\d+/)?)?)\d+_[a-f0-9]+(_[a-z])?\.jpg$
        ^https?://www\.flickr\.com/change_language\.gne
        ^https?://www\.google\.com/(reader/link\?|buzz/post\?)
        ^https?://www\.google\.com/accounts/AccountChooser
        ^https?://www\.google\.com/bookmarks/mark\?
        ^https?://www\.google\.com/recaptcha/(api|mailhide/d\?)
        ^https?://www\.infomous\.com/cloud_widget/lib/lib/
        ^https?://www\.khaleejtimes\.com/.+/images/.+/images/
        ^https?://www\.khaleejtimes\.com/.+/imgactv/.+/imgactv/
        ^https?://www\.khaleejtimes\.com/.+/kt_.+/kt_
        ^https?://www\.khanacademy\.org(/.*|/)page/%d/$
        ^https?://www\.khanacademy\.org/(wp-admin/|wp-login\.php\?)
        ^https?://www\.khanacademy\.org/.*%5Cx26route=/archive
        ^https?://www\.khanacademy\.org/.*&amp;amp;amp;
        ^https?://www\.khanacademy\.org/.*(\?|%5Cx26)route=(/page/:page|/archive/:year/:month|/tagged/:tag|/post/:id|/image/:post_id)
        ^https?://www\.khanacademy\.org/.*amp%3Bamp%3Bamp%3B
        ^https?://www\.khanacademy\.org/.+/%3Ca%20href=
        ^https?://www\.khanacademy\.org/.+/jetpack-comment/\?blogid=\d+&postid=\d+
        ^https?://www\.khanacademy\.org/.+/plugins/ultimate-social-media-plus/.+/like/like/
        ^https?://www\.khanacademy\.org/.+/quote-comment-\d+/$
        ^https?://www\.khanacademy\.org/.+[\?&](replyto(com)?|like_comment)=\d+
        ^https?://www\.khanacademy\.org/.+[\?&]mode=reply
        ^https?://www\.khanacademy\.org/.+[\?&]share=[a-z]{4,}
        ^https?://www\.khanacademy\.org/.+\?showComment(=|%5C)\d+
        ^https?://www\.khanacademy\.org/search(/label/[^\?]+|\?q=[^&]+|)[\?&]updated-(min|max)=\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d.*&max-results=\d+
        ^https?://www\.linkedin\.com/(cws/share|shareArticle)\?
        ^https?://www\.livejournal\.com/(tools/memadd|update|(identity/)?login)\.bml\?
        ^https?://www\.netvibes\.com/subscribe\.php\?
        ^https?://www\.newsvine\.com/_wine/save\?
        ^https?://www\.odnoklassniki\.ru/dk\?st\.cmd=addShare
        ^https?://www\.warnerbros\.com/\d+$
        ^https?://www\.youtube\.com/.*\[\[.+\]\]
        ^https?://www\.youtube\.com/.*\{\{.+\}\}
        ^https?://zakladki\.yandex\.ru/newlink\.xml\?
/home/thetechrobo/gs-venv/lib/python3.7/site-packages/sqlalchemy/sql/coercions.py:308: SAWarning: implicitly coercing SELECT object to scalar subquery; please use the .scalar_subquery() method to produce a scalar subquery.
  "implicitly coercing SELECT object to scalar subquery; "
Disconnected from ws:// server: ConnectionRefusedError(111, "Connect call failed ('127.0.0.1', 29000)")
Imported /home/thetechrobo/grab-site/www.khanacademy.org-test-prep-mcat-2021-05-05-37e0452b/ignores
Using these 191 ignores:
        %25252525
        /%22%20\+[^/]+\+%20%22
        /%22\+[^/]+\+%22
        /%27%20\+[^/]+\+%20%27
        /%27\+[^/]+\+%27
        /%5C/%5C/
        /'\+[^/]+\+'
        /(%5C)+(%22|%27)
        /App_Themes/.+/App_Themes/
        /\\+(%22|%27)
        /\\+["']
        /\\/\\/
        /bxSlider/.+/bxSlider/
        /bxSlider/bxSlider/
        /clientscript/.+/clientscript/clientscript/
        /clientscript/clientscript/.+/clientscript/
        /clientscript/clientscript/clientscript/
        /css/.+/css/css/
        /css/css/.+/css/
        /css/css/css/
        /images/.+/images/images/
        /images/images/.+/images/
        /images/images/images/
        /img/.+/img/img/
        /img/img/.+/img/
        /img/img/img/
        /js/.+/js/js/
        /js/js/.+/js/
        /js/js/js/
        /lib/exe/.*lib[-_]exe[-_]lib[-_]exe[-_]
        /scripts/.+/scripts/scripts/
        /scripts/scripts/.+/scripts/
        /scripts/scripts/scripts/
        /slides/.+/slides/slides/
        /slides/slides/.+/slides/
        /slides/slides/slides/
        /styles/.+/styles/styles/
        /styles/styles/.+/styles/
        /styles/styles/styles/
        ^https?://((s-)?static\.ak\.fbcdn\.net|(connect\.|www\.)?facebook\.com)/connect\.php/js/.*rsrc\.php
        ^https?://([^/]+\.)?gdcvault\.com(/.*/|/)(fonts(/.*/|/)fonts/|css(/.*/|/)css/|img(/.*/|/)img/)
        ^https?://([^\./]+\.)?stream\.publicradio\.org/
        ^https?://([^\.]+\.)?pinterest\.com/pin/create/
        ^https?://(\d|www|secure)\.gravatar\.com/avatar/ad516503a11cd5ca435acc9bb6523536
        ^https?://(apis|plusone)\.google\.com/_/\+1/
        ^https?://(audio\d?|nfw)\.video\.ria\.ru/
        ^https?://(ssl\.|www\.)?reddit\.com/(login\?dest=|submit\?|static/button/button)
        ^https?://(www\.)?(megaupload|filesonic|wupload)\.com/
        ^https?://(www\.)?digg\.com/submit\?
        ^https?://(www\.)?facebook\.com/(plugins/(share_button|like(box)?)\.php|sharer/sharer\.php|sharer?\.php|dialog/(feed|share))\?
        ^https?://(www\.)?facebook\.com/v[\d\.]+/plugins/like\.php
        ^https?://(www\.)?friendfeed\.com/share\?
        ^https?://(www\.)?instapaper\.com/hello2\?
        ^https?://(www\.)?myspace\.com/Modules/PostTo/
        ^https?://(www\.)?stumbleupon\.com/(submit\?|badge/embed/)
        ^https?://(www\.)?technorati\.com/faves/?\?add=
        ^https?://(www\.)?twitter\.com/(share\?|intent/((re)?tweet|favorite)|home/?\?status=|\?status=)
        ^https?://(www\.)?xing\.com/(app/user\?op=share|social_plugins/share\?)
        ^https?://(www|draft)\.blogger\.com/(navbar\.g|post-edit\.g|delete-comment\.g|comment-iframe\.g|share-post\.g|email-post\.g|blog-this\.g|delete-backlink\.g|rearrange|blog_this\.pyra)\?
        ^https?://(www|px\.srvcs)\.tumblr\.com/(impixu\?|share(/link/?)?\?|reblog/)
        ^https?://(www|ssl)\.google-analytics\.com/(r/)?(__utm\.gif|collect\?)
        ^https?://.+/.+/disqus\.com/forums/$
        ^https?://.+/js-agent\.newrelic\.com/nr-\d{3}(\.min)?\.js$
        ^https?://.+/js/chartbeat\.js$
        ^https?://.+/stats\.g\.doubleclick\.net/dc\.js$
        ^https?://.+\.blogspot\.(com|in|com\.au|co\.uk|jp|co\.nz|ca|de|it|fr|se|sg|es|pt|com\.br|ar|mx|kr)/(\d{4}/\d{2}/|search/label/)(CSI/$|.*/CSI/CSI/CSI/)
        ^https?://[^/]*musicproxy\.s12\.de/
        ^https?://[^/]+/.+/CaptchaImage\.axd
        ^https?://[^/]+/anony/mjpg\.cgi$
        ^https?://[^/]+/mjpg/video\.mjpg
        ^https?://[^/]+\.akadostream\.ru(:\d+)?/
        ^https?://[^/]+\.corp\.ne1\.yahoo\.com/
        ^https?://[^/]+\.facebook\.com/login\.php
        ^https?://[^/]+\.gaduradio\.pl/
        ^https?://[^/]+\.libsyn\.com/.+/%2[02]https?:/
        ^https?://[^/]+\.rastream\.com(:\d+)?/
        ^https?://[^/]+\.services\.livejournal\.com/ljcounter
        ^https?://[^/]+\.streamtheworld\.com/
        ^https?://[^/]+\.xiti\.com/hit\.xiti\?
        ^https?://[^\./]+\.radioscoop\.(com|net):\d+/
        ^https?://[^\./]+\.streamchan\.org:\d+/
        ^https?://[^\.]+\.livejournal\.com/.+/\*sup_ru/ru/UTF-8/
        ^https?://[^\.]+\.livejournal\.com/.+http://[^\.]+\.livejournal\.com/
        ^https?://[a-z0-9]+\.cdn\.dvmr\.fr(:\d+)?/.+\.mp3
        ^https?://\d+\.media\.tumblr\.com/avatar_.+_16\.pn[gj]$
        ^https?://accounts\.google\.com/(SignUp|ServiceLogin|AccountChooser|a/UniversalLogin)
        ^https?://add\.my\.yahoo\.com/(rss|content)\?
        ^https?://air\.radiorecord\.ru(:\d+)?/
        ^https?://alb\.reddit\.com/
        ^https?://api\.addthis\.com/
        ^https?://audio\d?\.radioreference\.com/
        ^https?://audiots\.scdn\.arkena\.com/
        ^https?://av\.rasset\.ie/av/live/
        ^https?://b\.hatena\.ne\.jp/add\?
        ^https?://b\.scorecardresearch\.com/
        ^https?://beacon\.wikia-services\.com/
        ^https?://bookmark\.naver\.com/post\?
        ^https?://bufferapp\.com/add\?
        ^https?://connect\.mail\.ru/share\?
        ^https?://csp\.cyworld\.com/bi/bi_recommend_pop\.php\?
        ^https?://del\.icio\.us/post\?
        ^https?://delicious\.com/(save|post)\?
        ^https?://download\.ted\.com/
        ^https?://flattr.com/submit/auto\?
        ^https?://gcnplayer\.gcnlive\.com/.+
        ^https?://geo\.yahoo\.com/b\?
        ^https?://getpocket\.com/(save|edit)/?\?
        ^https?://i\.dev\.cdn\.turner\.com/
        ^https?://imageshack\.com/lost$
        ^https?://iwiw\.hu/pages/share/share\.jsp\?
        ^https?://mail\.google\.com/mail/
        ^https?://media\.opb\.org/clips/embed/.+\.js$
        ^https?://medium\.com/_/(vote|bookmark|subscribe)/
        ^https?://memori(\.qip)?\.ru/link/\?
        ^https?://mp3\.ffh\.de/
        ^https?://mp3tslg\.tdf-cdn\.com/
        ^https?://myweb2\.search\.yahoo\.com/myresults/bookmarklet\?
        ^https?://news\.ycombinator\.com/submitlink\?
        ^https?://p\.opt\.fimserve\.com/
        ^https?://photobucket\.com/.+/albums/.+/albums/
        ^https?://pixel\.(quantserve|wp)\.com/
        ^https?://pixel\.blog\.hu/
        ^https?://pixel\.redditmedia\.com/pixel/
        ^https?://platform\d?\.twitter\.com/widgets/tweet_button.html\?
        ^https?://play(\d+)?\.radio13\.ru:8000/
        ^https?://plus\.google\.com/share\?
        ^https?://posterous\.com/share\?
        ^https?://prod-preview\.wired\.com/
        ^https?://pub(\d+)?\.di\.fm/
        ^https?://r-a-d\.io/.+\.mp3$
        ^https?://r-login\.wordpress\.com/remote-login\.php
        ^https?://relay\.broadcastify\.com/
        ^https?://reporter\.es\.msn\.com/\?fn=contribute
        ^https?://s\d+\.sitemeter\.com/(js/counter\.js|meter\.asp)
        ^https?://service\.weibo\.com/share/share\.php\?
        ^https?://share\.flipboard\.com/bookmarklet/popout\?
        ^https?://social-plugins\.line\.me/lineit/share
        ^https?://sphinn\.com/index\.php\?c=post&m=submit&
        ^https?://static\.licdn\.com/sc/p/.+/f//
        ^https?://static\.licdn\.com/sc/p/com\.linkedin\.nux(:|%3A)nux-static-content(\+|%2B)[\d\.]+/f/
        ^https?://stream(\d+)?\.media\.rambler\.ru/
        ^https?://telegram\.me/share/url\?
        ^https?://tm\.uol\.com\.br/h/.+/h/
        ^https?://tmz\.vo\.llnwd\.net/
        ^https?://upload\.wikimedia\.org/wikipedia/[^/]+/thumb/
        ^https?://video-subtitle\.tedcdn\.com/
        ^https?://vkontakte\.ru/share\.php\?
        ^https?://vuible\.com/pins-settings/
        ^https?://web\.archive\.org/web/[^/]+/https?\:/[^/]+\.addthis\.com/.+/static/.+/static/
        ^https?://wow\.ya\.ru/posts_(add|share)_link\.xml\?
        ^https?://www\.addthis\.com/bookmark\.php\?
        ^https?://www\.addtoany\.com/(add_to/|share_save\?)
        ^https?://www\.amazon\.com/.+/logging/log-action\.html
        ^https?://www\.blinklist\.com/index\.php\?Action=Blink/addblink\.php
        ^https?://www\.blogger\.com/feeds/\d+/\d+/comments/default/\d+
        ^https?://www\.blogger\.com/feeds/\d+/posts/default/\d+
        ^https?://www\.dreamwidth\.org/tools/(memadd|tellafriend)\?
        ^https?://www\.flickr\.com/(explore/|photos/[^/]+/(sets/\d+/(page\d+/)?)?)\d+_[a-f0-9]+(_[a-z])?\.jpg$
        ^https?://www\.flickr\.com/change_language\.gne
        ^https?://www\.google\.com/(reader/link\?|buzz/post\?)
        ^https?://www\.google\.com/accounts/AccountChooser
        ^https?://www\.google\.com/bookmarks/mark\?
        ^https?://www\.google\.com/recaptcha/(api|mailhide/d\?)
        ^https?://www\.infomous\.com/cloud_widget/lib/lib/
        ^https?://www\.khaleejtimes\.com/.+/images/.+/images/
        ^https?://www\.khaleejtimes\.com/.+/imgactv/.+/imgactv/
        ^https?://www\.khaleejtimes\.com/.+/kt_.+/kt_
        ^https?://www\.khanacademy\.org(/.*|/)page/%d/$
        ^https?://www\.khanacademy\.org/(wp-admin/|wp-login\.php\?)
        ^https?://www\.khanacademy\.org/.*%5Cx26route=/archive
        ^https?://www\.khanacademy\.org/.*&amp;amp;amp;
        ^https?://www\.khanacademy\.org/.*(\?|%5Cx26)route=(/page/:page|/archive/:year/:month|/tagged/:tag|/post/:id|/image/:post_id)
        ^https?://www\.khanacademy\.org/.*amp%3Bamp%3Bamp%3B
        ^https?://www\.khanacademy\.org/.+/%3Ca%20href=
        ^https?://www\.khanacademy\.org/.+/jetpack-comment/\?blogid=\d+&postid=\d+
        ^https?://www\.khanacademy\.org/.+/plugins/ultimate-social-media-plus/.+/like/like/
        ^https?://www\.khanacademy\.org/.+/quote-comment-\d+/$
        ^https?://www\.khanacademy\.org/.+[\?&](replyto(com)?|like_comment)=\d+
        ^https?://www\.khanacademy\.org/.+[\?&]mode=reply
        ^https?://www\.khanacademy\.org/.+[\?&]share=[a-z]{4,}
        ^https?://www\.khanacademy\.org/.+\?showComment(=|%5C)\d+
        ^https?://www\.khanacademy\.org/search(/label/[^\?]+|\?q=[^&]+|)[\?&]updated-(min|max)=\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d.*&max-results=\d+
        ^https?://www\.linkedin\.com/(cws/share|shareArticle)\?
        ^https?://www\.livejournal\.com/(tools/memadd|update|(identity/)?login)\.bml\?
        ^https?://www\.netvibes\.com/subscribe\.php\?
        ^https?://www\.newsvine\.com/_wine/save\?
        ^https?://www\.odnoklassniki\.ru/dk\?st\.cmd=addShare
        ^https?://www\.warnerbros\.com/\d+$
        ^https?://www\.youtube\.com/.*\[\[.+\]\]
        ^https?://www\.youtube\.com/.*\{\{.+\}\}
        ^https?://zakladki\.yandex\.ru/newlink\.xml\?
Disconnected from ws:// server: ConnectionRefusedError(111, "Connect call failed ('127.0.0.1', 29000)")
Imported /home/thetechrobo/grab-site/www.khanacademy.org-test-prep-mcat-2021-05-05-37e0452b/max_content_length
https://www.khanacademy.org/test-prep/mcat ...
/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/protocol/http/client.py:185: UserWarning: HTTP session did not complete.
  warnings.warn(_('HTTP session did not complete.'))
Disconnected from ws:// server: ConnectionRefusedError(111, "Connect call failed ('127.0.0.1', 29000)")
ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1936, in _do_pre_synchronize_evaluate
    eval_condition = evaluator_compiler.process(*crit)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/sqlalchemy/orm/evaluator.py", line 85, in process
    return meth(clause)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/sqlalchemy/orm/evaluator.py", line 181, in visit_binary
    map(self.process, [clause.left, clause.right])
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/sqlalchemy/orm/evaluator.py", line 85, in process
    return meth(clause)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/sqlalchemy/orm/evaluator.py", line 88, in visit_grouping
    return self.process(clause.element)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/sqlalchemy/orm/evaluator.py", line 83, in process
    "Cannot evaluate %s" % type(clause).__name__
sqlalchemy.orm.evaluator.UnevaluatableError: Cannot evaluate Select

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/application/app.py", line 157, in run
    yield from pipeline.process()
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 194, in process
    yield from self._process_one_worker()
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
    task.result()
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 119, in process
    item = yield from self.process_one(_worker_id=worker_id)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
    yield from task.process(item)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/application/tasks/download.py", line 421, in process
    yield from session.app_session.factory['Processor'].process(session)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/processor/delegate.py", line 29, in process
    return (yield from processor.process(item_session))
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/processor/web.py", line 91, in process
    return (yield from session.process())
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/processor/web.py", line 185, in process
    yield from self._process_loop()
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/processor/web.py", line 244, in _process_loop
    exit_early, wait_time = yield from self._fetch_one(cast(Request, self._item_session.request))
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/processor/web.py", line 308, in _fetch_one
    action = self._handle_response(request, response)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/processor/web.py", line 410, in _handle_response
    self._item_session.update_record_value(status_code=response.status_code)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/pipeline/session.py", line 176, in update_record_value
    self.app_session.factory['URLTable'].update_one(self.url_record.url, **kwargs)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/database/wrap.py", line 72, in update_one
    return self.url_table.update_one(*args, **kwargs)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/wpull/database/sqltable.py", line 196, in update_one
    session.execute(query)
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1638, in execute
    _parent_execute_state is not None,
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1821, in orm_pre_session_exec
    update_options,
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1949, in _do_pre_synchronize_evaluate
    from_=err,
  File "/home/thetechrobo/gs-venv/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
sqlalchemy.exc.InvalidRequestError: Could not evaluate current criteria in Python: "Cannot evaluate Select". Specify 'fetch' or False for the synchronize_session execution option.
CRITICAL Sorry, Wpull unexpectedly crashed.

Sorry if this is a stupid question, i'm a newcomer.

ivan commented 3 years ago

Thank you for the report, I believe this is caused by a sqlalchemy 1.4 incompatibility as found in https://github.com/ArchiveTeam/wpull/issues/463.

I will try to get this fixed soon. In the meantime, the Nix-based grab-site install might work (it still has the older sqlalchemy): https://github.com/ArchiveTeam/grab-site#install-on-another-distribution-lacking-python-37x

TheTechRobo commented 3 years ago

Running pip3 uninstall sqlalchemy followed by pip3 install sqlalchemy==1.3.\* works for me. I'm still puzzled over the Disconnected from ws:// server: ConnectionRefusedError(111, "Connect call failed ('127.0.0.1', 29000)"), I assume this a separate issue?

TheTechRobo commented 3 years ago

Figured it out. gs-server wasn't running.

ivan commented 3 years ago

This should be fixed in grab-site 2.2.1.