ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

Segfault when run under Windows Subsystem for Linux #102

Closed ivan closed 6 years ago

ivan commented 7 years ago

A user reports that using grab-site --no-dupespotter avoids this problem, so it is likely lmdb-related:

$ ~/.local/bin/grab-site https://schoology.hsd.k12.or.us/ --dir /mnt/c/Users/Tyler/Documents/grabsite --finished-warc-dir /mnt/c/Users/Tyler/Documents/warc --wpull-args=--load-cookies=/mnt/c/Users/Tyler/Documents/cookies.txt --concurrency 1 --delay 10500 --ua "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0"
psutil: No module named 'psutil'. Resource monitoring will be unavailable.
Created lmdb db with map_size=2147483647
Picked up the changes to /mnt/c/Users/Tyler/Documents/grabsite/igsets
Using these 180 ignores:
        %25252525
        /%22%20\+[^/]+\+%20%22
        /%22\+[^/]+\+%22
        /%27%20\+[^/]+\+%20%27
        /%27\+[^/]+\+%27
        /%5C/%5C/
        /'\+[^/]+\+'
        /(%5C)+(%22|%27)
        /App_Themes/.+/App_Themes/
        /\\+(%22|%27)
        /\\+["']
        /\\/\\/
        /bxSlider/.+/bxSlider/
        /bxSlider/bxSlider/
        /clientscript/.+/clientscript/clientscript/
        /clientscript/clientscript/.+/clientscript/
        /clientscript/clientscript/clientscript/
        /css/.+/css/css/
        /css/css/.+/css/
        /css/css/css/
        /images/.+/images/images/
        /images/images/.+/images/
        /images/images/images/
        /img/.+/img/img/
        /img/img/.+/img/
        /img/img/img/
        /js/.+/js/js/
        /js/js/.+/js/
        /js/js/js/
        /lib/exe/.*lib[-_]exe[-_]lib[-_]exe[-_]
        /scripts/.+/scripts/scripts/
        /scripts/scripts/.+/scripts/
        /scripts/scripts/scripts/
        /slides/.+/slides/slides/
        /slides/slides/.+/slides/
        /slides/slides/slides/
        /styles/.+/styles/styles/
        /styles/styles/.+/styles/
        /styles/styles/styles/
        ^https?://((s-)?static\.ak\.fbcdn\.net|(connect\.|www\.)?facebook\.com)/connect\.php/js/.*rsrc\.php
        ^https?://([^/]+\.)?gdcvault\.com(/.*/|/)(fonts(/.*/|/)fonts/|css(/.*/|/)css/|img(/.*/|/)img/)
        ^https?://([^\./]+\.)?stream\.publicradio\.org/
        ^https?://(\d|www|secure)\.gravatar\.com/avatar/ad516503a11cd5ca435acc9bb6523536
        ^https?://(apis|plusone)\.google\.com/_/\+1/
        ^https?://(audio\d?|nfw)\.video\.ria\.ru/
        ^https?://(ssl\.|www\.)?reddit\.com/(login\?dest=|submit\?|static/button/button)
        ^https?://(www\.)?digg\.com/submit\?
        ^https?://(www\.)?facebook\.com/(plugins/like(box)?\.php|sharer/sharer\.php|sharer?\.php|dialog/(feed|share))\?
        ^https?://(www\.)?filesonic\.com/
        ^https?://(www\.)?friendfeed\.com/share\?
        ^https?://(www\.)?instapaper\.com/hello2\?
        ^https?://(www\.)?megaupload\.com/
        ^https?://(www\.)?myspace\.com/Modules/PostTo/
        ^https?://(www\.)?pinterest\.com/pin/create/
        ^https?://(www\.)?stumbleupon\.com/(submit\?|badge/embed/)
        ^https?://(www\.)?technorati\.com/faves/?\?add=
        ^https?://(www\.)?twitter\.com/(share\?|intent/((re)?tweet|favorite)|home/?\?status=|\?status=)
        ^https?://(www\.)?wupload\.com/
        ^https?://(www\.)?xing\.com/(app/user\?op=share|social_plugins/share\?)
        ^https?://(www|draft)\.blogger\.com/(navbar\.g|post-edit\.g|delete-comment\.g|comment-iframe\.g|share-post\.g|email-post\.g|blog-this\.g|delete-backlink\.g|rearrange|blog_this\.pyra)\?
        ^https?://(www|px\.srvcs)\.tumblr\.com/(impixu\?|share(/link/?)?\?|reblog/)
        ^https?://(www|ssl)\.google-analytics\.com/(r/)?(__utm\.gif|collect\?)
        ^https?://.+/.+/disqus\.com/forums/$
        ^https?://.+/js-agent\.newrelic\.com/nr-\d{3}(\.min)?\.js$
        ^https?://.+/js/chartbeat\.js$
        ^https?://.+/stats\.g\.doubleclick\.net/dc\.js$
        ^https?://.+\.blogspot\.(com|in|com\.au|co\.uk|jp|co\.nz|ca|de|it|fr|se|sg|es|pt|com\.br|ar|mx|kr)/(\d{4}/\d{2}/|search/label/)(CSI/$|.*/CSI/CSI/CSI/)
        ^https?://[^/]*musicproxy\.s12\.de/
        ^https?://[^/]+/.+/CaptchaImage\.axd
        ^https?://[^/]+/anony/mjpg\.cgi$
        ^https?://[^/]+\.akadostream\.ru(:\d+)?/
        ^https?://[^/]+\.corp\.ne1\.yahoo\.com/
        ^https?://[^/]+\.facebook\.com/login\.php
        ^https?://[^/]+\.gaduradio\.pl/
        ^https?://[^/]+\.libsyn\.com/.+/%2[02]https?:/
        ^https?://[^/]+\.rastream\.com(:\d+)?/
        ^https?://[^/]+\.services\.livejournal\.com/ljcounter
        ^https?://[^/]+\.streamtheworld\.com/
        ^https?://[^/]+\.xiti\.com/hit\.xiti\?
        ^https?://[^\./]+\.radioscoop\.(com|net):\d+/
        ^https?://[^\./]+\.streamchan\.org:\d+/
        ^https?://[^\.]+\.livejournal\.com/.+/\*sup_ru/ru/UTF-8/
        ^https?://[^\.]+\.livejournal\.com/.+http://[^\.]+\.livejournal\.com/
        ^https?://\d+\.media\.tumblr\.com/avatar_.+_16\.png$
        ^https?://add\.my\.yahoo\.com/(rss|content)\?
        ^https?://air\.radiorecord\.ru(:\d+)?/
        ^https?://api\.addthis\.com/
        ^https?://audio\d?\.radioreference\.com/
        ^https?://audiots\.scdn\.arkena\.com/
        ^https?://av\.rasset\.ie/av/live/
        ^https?://b\.hatena\.ne\.jp/add\?
        ^https?://b\.scorecardresearch\.com/
        ^https?://bookmark\.naver\.com/post\?
        ^https?://bufferapp\.com/add\?
        ^https?://connect\.mail\.ru/share\?
        ^https?://csp\.cyworld\.com/bi/bi_recommend_pop\.php\?
        ^https?://del\.icio\.us/post\?
        ^https?://delicious\.com/(save|post)\?
        ^https?://download\.ted\.com/
        ^https?://flattr.com/submit/auto\?
        ^https?://gcnplayer\.gcnlive\.com/.+
        ^https?://geo\.yahoo\.com/b\?
        ^https?://getpocket\.com/save/?\?
        ^https?://i\.dev\.cdn\.turner\.com/
        ^https?://imageshack\.com/lost$
        ^https?://iwiw\.hu/pages/share/share\.jsp\?
        ^https?://mail\.google\.com/mail/
        ^https?://media\.opb\.org/clips/embed/.+\.js$
        ^https?://memori(\.qip)?\.ru/link/\?
        ^https?://mp3\.ffh\.de/
        ^https?://mp3tslg\.tdf-cdn\.com/
        ^https?://myweb2\.search\.yahoo\.com/myresults/bookmarklet\?
        ^https?://news\.ycombinator\.com/submitlink\?
        ^https?://p\.opt\.fimserve\.com/
        ^https?://photobucket\.com/.+/albums/.+/albums/
        ^https?://pixel\.blog\.hu/
        ^https?://pixel\.quantserve\.com/
        ^https?://pixel\.redditmedia\.com/pixel/
        ^https?://platform\d?\.twitter\.com/widgets/tweet_button.html\?
        ^https?://play(\d+)?\.radio13\.ru:8000/
        ^https?://plus\.google\.com/share\?
        ^https?://posterous\.com/share\?
        ^https?://prod-preview\.wired\.com/
        ^https?://pub(\d+)?\.di\.fm/
        ^https?://r-a-d\.io/.+\.mp3$
        ^https?://r-login\.wordpress\.com/remote-login\.php
        ^https?://relay\.broadcastify\.com/
        ^https?://reporter\.es\.msn\.com/\?fn=contribute
        ^https?://service\.weibo\.com/share/share\.php\?
        ^https?://share\.flipboard\.com/bookmarklet/popout\?
        ^https?://sphinn\.com/index\.php\?c=post&m=submit&
        ^https?://static\.licdn\.com/sc/p/.+/f//
        ^https?://static\.licdn\.com/sc/p/com\.linkedin\.nux(:|%3A)nux-static-content(\+|%2B)[\d\.]+/f/
        ^https?://stream(\d+)?\.media\.rambler\.ru/
        ^https?://tm\.uol\.com\.br/h/.+/h/
        ^https?://tmz\.vo\.llnwd\.net/
        ^https?://upload\.wikimedia\.org/wikipedia/[^/]+/thumb/
        ^https?://video-subtitle\.tedcdn\.com/
        ^https?://vkontakte\.ru/share\.php\?
        ^https?://vuible\.com/pins-settings/
        ^https?://web\.archive\.org/web/[^/]+/https?\:/[^/]+\.addthis\.com/.+/static/.+/static/
        ^https?://wow\.ya\.ru/posts_(add|share)_link\.xml\?
        ^https?://www\.addthis\.com/bookmark\.php\?
        ^https?://www\.addtoany\.com/(add_to/|share_save\?)
        ^https?://www\.blinklist\.com/index\.php\?Action=Blink/addblink\.php
        ^https?://www\.blogger\.com/feeds/\d+/\d+/comments/default/\d+
        ^https?://www\.blogger\.com/feeds/\d+/posts/default/\d+
        ^https?://www\.dreamwidth\.org/tools/(memadd|tellafriend)\?
        ^https?://www\.flickr\.com/(explore/|photos/[^/]+/(sets/\d+/(page\d+/)?)?)\d+_[a-f0-9]+(_[a-z])?\.jpg$
        ^https?://www\.flickr\.com/change_language\.gne
        ^https?://www\.google\.((com|ad|ae|al|am|as|at|az|ba|be|bf|bg|bi|bj|bs|bt|by|ca|cd|cf|cg|ch|ci|cl|cm|cn|cv|cz|de|dj|dk|dm|dz|ee|es|fi|fm|fr|ga|ge|gg|gl|gm|gp|gr|gy|hn|hr|ht|hu|ie|im|iq|is|it|je|jo|ki|kg|kz|la|li|lk|lt|lu|lv|md|me|mg|mk|ml|mn|ms|mu|mv|mw|ne|nl|no|nr|nu|pl|pn|ps|pt|ro|ru|rw|sc|se|sh|si|sk|sn|so|sm|sr|st|td|tg|tk|tl|tm|tn|to|tt|vg|vu|ws|rs|cat)|(com\.(af|ag|ai|ar|au|bd|bh|bn|bo|br|bz|co|cu|cy|do|ec|eg|et|fj|gh|gi|gt|hk|jm|kh|kw|lb|ly|mm|mt|mx|my|na|nf|ng|ni|np|om|pa|pe|pg|ph|pk|pr|py|qa|sa|sb|sg|sl|sv|tj|tr|tw|ua|uy|vc|vn))|(co\.(ao|bw|ck|cr|id|il|in|jp|ke|kr|ls|ma|mz|nz|th|tz|ug|uk|uz|ve|vi|za|zm|zw)))/finance\?noIL=1&q=[^&]+&ei=
        ^https?://www\.google\.com/(reader/link\?|buzz/post\?)
        ^https?://www\.google\.com/bookmarks/mark\?
        ^https?://www\.google\.com/recaptcha/(api|mailhide/d\?)
        ^https?://www\.infomous\.com/cloud_widget/lib/lib/
        ^https?://www\.khaleejtimes\.com/.+/images/.+/images/
        ^https?://www\.khaleejtimes\.com/.+/imgactv/.+/imgactv/
        ^https?://www\.khaleejtimes\.com/.+/kt_.+/kt_
        ^https?://www\.linkedin\.com/(cws/share|shareArticle)\?
        ^https?://www\.livejournal\.com/(tools/memadd|update|(identity/)?login)\.bml\?
        ^https?://www\.netvibes\.com/subscribe\.php\?
        ^https?://www\.newsvine\.com/_wine/save\?
        ^https?://www\.odnoklassniki\.ru/dk\?st\.cmd=addShare
        ^https?://www\.warnerbros\.com/\d+$
        ^https?://zakladki\.yandex\.ru/newlink\.xml\?
        ^https?://{primary_netloc}(/.*|/)page/%d/$
        ^https?://{primary_netloc}/(wp-admin/|wp-login\.php\?)
        ^https?://{primary_netloc}/.*%5Cx26route=/archive
        ^https?://{primary_netloc}/.*&
        ^https?://{primary_netloc}/.*(\?|%5Cx26)route=(/page/:page|/archive/:year/:month|/tagged/:tag|/post/:id|/image/:post_id)
        ^https?://{primary_netloc}/.*amp%3Bamp%3Bamp%3B
        ^https?://{primary_netloc}/.+/%3Ca%20href=
        ^https?://{primary_netloc}/.+/jetpack-comment/\?blogid=\d+&postid=\d+
        ^https?://{primary_netloc}/.+/plugins/ultimate-social-media-plus/.+/like/like/
        ^https?://{primary_netloc}/.+/quote-comment-\d+/$
        ^https?://{primary_netloc}/.+[\?&](replyto(com)?|like_comment)=\d+
        ^https?://{primary_netloc}/.+[\?&]mode=reply
        ^https?://{primary_netloc}/.+[\?&]share=[a-z]{4,}
        ^https?://{primary_netloc}/.+\?showComment(=|%5C)\d+
        ^https?://{primary_netloc}/search(/label/[^\?]+|)\?updated-(min|max)=\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d.*&max-results=\d+
Picked up the changes to /mnt/c/Users/Tyler/Documents/grabsite/max_content_length
Picked up the changes to /mnt/c/Users/Tyler/Documents/grabsite/delay
Picked up the changes to /mnt/c/Users/Tyler/Documents/grabsite/concurrency
Picked up the changes to /mnt/c/Users/Tyler/Documents/grabsite/custom_hooks.py
Manhole[1498932882.2324]: Patched <built-in function fork> and <built-in function fork>.
Manhole[1498932882.2326]: Manhole UDS path: /tmp/manhole-3686
Manhole[1498932882.2335]: Waiting for new connection (in pid:3686) ...
Connected to ws://127.0.0.1:29000
Picked up the changes to /mnt/c/Users/Tyler/Documents/grabsite/ignores
Using these 180 ignores:
        %25252525
        /%22%20\+[^/]+\+%20%22
        /%22\+[^/]+\+%22
        /%27%20\+[^/]+\+%20%27
        /%27\+[^/]+\+%27
        /%5C/%5C/
        /'\+[^/]+\+'
        /(%5C)+(%22|%27)
        /App_Themes/.+/App_Themes/
        /\\+(%22|%27)
        /\\+["']
        /\\/\\/
        /bxSlider/.+/bxSlider/
        /bxSlider/bxSlider/
        /clientscript/.+/clientscript/clientscript/
        /clientscript/clientscript/.+/clientscript/
        /clientscript/clientscript/clientscript/
        /css/.+/css/css/
        /css/css/.+/css/
        /css/css/css/
        /images/.+/images/images/
        /images/images/.+/images/
        /images/images/images/
        /img/.+/img/img/
        /img/img/.+/img/
        /img/img/img/
        /js/.+/js/js/
        /js/js/.+/js/
        /js/js/js/
        /lib/exe/.*lib[-_]exe[-_]lib[-_]exe[-_]
        /scripts/.+/scripts/scripts/
        /scripts/scripts/.+/scripts/
        /scripts/scripts/scripts/
        /slides/.+/slides/slides/
        /slides/slides/.+/slides/
        /slides/slides/slides/
        /styles/.+/styles/styles/
        /styles/styles/.+/styles/
        /styles/styles/styles/
        ^https?://((s-)?static\.ak\.fbcdn\.net|(connect\.|www\.)?facebook\.com)/connect\.php/js/.*rsrc\.php
        ^https?://([^/]+\.)?gdcvault\.com(/.*/|/)(fonts(/.*/|/)fonts/|css(/.*/|/)css/|img(/.*/|/)img/)
        ^https?://([^\./]+\.)?stream\.publicradio\.org/
        ^https?://(\d|www|secure)\.gravatar\.com/avatar/ad516503a11cd5ca435acc9bb6523536
        ^https?://(apis|plusone)\.google\.com/_/\+1/
        ^https?://(audio\d?|nfw)\.video\.ria\.ru/
        ^https?://(ssl\.|www\.)?reddit\.com/(login\?dest=|submit\?|static/button/button)
        ^https?://(www\.)?digg\.com/submit\?
        ^https?://(www\.)?facebook\.com/(plugins/like(box)?\.php|sharer/sharer\.php|sharer?\.php|dialog/(feed|share))\?
        ^https?://(www\.)?filesonic\.com/
        ^https?://(www\.)?friendfeed\.com/share\?
        ^https?://(www\.)?instapaper\.com/hello2\?
        ^https?://(www\.)?megaupload\.com/
        ^https?://(www\.)?myspace\.com/Modules/PostTo/
        ^https?://(www\.)?pinterest\.com/pin/create/
        ^https?://(www\.)?stumbleupon\.com/(submit\?|badge/embed/)
        ^https?://(www\.)?technorati\.com/faves/?\?add=
        ^https?://(www\.)?twitter\.com/(share\?|intent/((re)?tweet|favorite)|home/?\?status=|\?status=)
        ^https?://(www\.)?wupload\.com/
        ^https?://(www\.)?xing\.com/(app/user\?op=share|social_plugins/share\?)
        ^https?://(www|draft)\.blogger\.com/(navbar\.g|post-edit\.g|delete-comment\.g|comment-iframe\.g|share-post\.g|email-post\.g|blog-this\.g|delete-backlink\.g|rearrange|blog_this\.pyra)\?
        ^https?://(www|px\.srvcs)\.tumblr\.com/(impixu\?|share(/link/?)?\?|reblog/)
        ^https?://(www|ssl)\.google-analytics\.com/(r/)?(__utm\.gif|collect\?)
        ^https?://.+/.+/disqus\.com/forums/$
        ^https?://.+/js-agent\.newrelic\.com/nr-\d{3}(\.min)?\.js$
        ^https?://.+/js/chartbeat\.js$
        ^https?://.+/stats\.g\.doubleclick\.net/dc\.js$
        ^https?://.+\.blogspot\.(com|in|com\.au|co\.uk|jp|co\.nz|ca|de|it|fr|se|sg|es|pt|com\.br|ar|mx|kr)/(\d{4}/\d{2}/|search/label/)(CSI/$|.*/CSI/CSI/CSI/)
        ^https?://[^/]*musicproxy\.s12\.de/
        ^https?://[^/]+/.+/CaptchaImage\.axd
        ^https?://[^/]+/anony/mjpg\.cgi$
        ^https?://[^/]+\.akadostream\.ru(:\d+)?/
        ^https?://[^/]+\.corp\.ne1\.yahoo\.com/
        ^https?://[^/]+\.facebook\.com/login\.php
        ^https?://[^/]+\.gaduradio\.pl/
        ^https?://[^/]+\.libsyn\.com/.+/%2[02]https?:/
        ^https?://[^/]+\.rastream\.com(:\d+)?/
        ^https?://[^/]+\.services\.livejournal\.com/ljcounter
        ^https?://[^/]+\.streamtheworld\.com/
        ^https?://[^/]+\.xiti\.com/hit\.xiti\?
        ^https?://[^\./]+\.radioscoop\.(com|net):\d+/
        ^https?://[^\./]+\.streamchan\.org:\d+/
        ^https?://[^\.]+\.livejournal\.com/.+/\*sup_ru/ru/UTF-8/
        ^https?://[^\.]+\.livejournal\.com/.+http://[^\.]+\.livejournal\.com/
        ^https?://\d+\.media\.tumblr\.com/avatar_.+_16\.png$
        ^https?://add\.my\.yahoo\.com/(rss|content)\?
        ^https?://air\.radiorecord\.ru(:\d+)?/
        ^https?://api\.addthis\.com/
        ^https?://audio\d?\.radioreference\.com/
        ^https?://audiots\.scdn\.arkena\.com/
        ^https?://av\.rasset\.ie/av/live/
        ^https?://b\.hatena\.ne\.jp/add\?
        ^https?://b\.scorecardresearch\.com/
        ^https?://bookmark\.naver\.com/post\?
        ^https?://bufferapp\.com/add\?
        ^https?://connect\.mail\.ru/share\?
        ^https?://csp\.cyworld\.com/bi/bi_recommend_pop\.php\?
        ^https?://del\.icio\.us/post\?
        ^https?://delicious\.com/(save|post)\?
        ^https?://download\.ted\.com/
        ^https?://flattr.com/submit/auto\?
        ^https?://gcnplayer\.gcnlive\.com/.+
        ^https?://geo\.yahoo\.com/b\?
        ^https?://getpocket\.com/save/?\?
        ^https?://i\.dev\.cdn\.turner\.com/
        ^https?://imageshack\.com/lost$
        ^https?://iwiw\.hu/pages/share/share\.jsp\?
        ^https?://mail\.google\.com/mail/
        ^https?://media\.opb\.org/clips/embed/.+\.js$
        ^https?://memori(\.qip)?\.ru/link/\?
        ^https?://mp3\.ffh\.de/
        ^https?://mp3tslg\.tdf-cdn\.com/
        ^https?://myweb2\.search\.yahoo\.com/myresults/bookmarklet\?
        ^https?://news\.ycombinator\.com/submitlink\?
        ^https?://p\.opt\.fimserve\.com/
        ^https?://photobucket\.com/.+/albums/.+/albums/
        ^https?://pixel\.blog\.hu/
        ^https?://pixel\.quantserve\.com/
        ^https?://pixel\.redditmedia\.com/pixel/
        ^https?://platform\d?\.twitter\.com/widgets/tweet_button.html\?
        ^https?://play(\d+)?\.radio13\.ru:8000/
        ^https?://plus\.google\.com/share\?
        ^https?://posterous\.com/share\?
        ^https?://prod-preview\.wired\.com/
        ^https?://pub(\d+)?\.di\.fm/
        ^https?://r-a-d\.io/.+\.mp3$
        ^https?://r-login\.wordpress\.com/remote-login\.php
        ^https?://relay\.broadcastify\.com/
        ^https?://reporter\.es\.msn\.com/\?fn=contribute
        ^https?://service\.weibo\.com/share/share\.php\?
        ^https?://share\.flipboard\.com/bookmarklet/popout\?
        ^https?://sphinn\.com/index\.php\?c=post&m=submit&
        ^https?://static\.licdn\.com/sc/p/.+/f//
        ^https?://static\.licdn\.com/sc/p/com\.linkedin\.nux(:|%3A)nux-static-content(\+|%2B)[\d\.]+/f/
        ^https?://stream(\d+)?\.media\.rambler\.ru/
        ^https?://tm\.uol\.com\.br/h/.+/h/
        ^https?://tmz\.vo\.llnwd\.net/
        ^https?://upload\.wikimedia\.org/wikipedia/[^/]+/thumb/
        ^https?://video-subtitle\.tedcdn\.com/
        ^https?://vkontakte\.ru/share\.php\?
        ^https?://vuible\.com/pins-settings/
        ^https?://web\.archive\.org/web/[^/]+/https?\:/[^/]+\.addthis\.com/.+/static/.+/static/
        ^https?://wow\.ya\.ru/posts_(add|share)_link\.xml\?
        ^https?://www\.addthis\.com/bookmark\.php\?
        ^https?://www\.addtoany\.com/(add_to/|share_save\?)
        ^https?://www\.blinklist\.com/index\.php\?Action=Blink/addblink\.php
        ^https?://www\.blogger\.com/feeds/\d+/\d+/comments/default/\d+
        ^https?://www\.blogger\.com/feeds/\d+/posts/default/\d+
        ^https?://www\.dreamwidth\.org/tools/(memadd|tellafriend)\?
        ^https?://www\.flickr\.com/(explore/|photos/[^/]+/(sets/\d+/(page\d+/)?)?)\d+_[a-f0-9]+(_[a-z])?\.jpg$
        ^https?://www\.flickr\.com/change_language\.gne
        ^https?://www\.google\.((com|ad|ae|al|am|as|at|az|ba|be|bf|bg|bi|bj|bs|bt|by|ca|cd|cf|cg|ch|ci|cl|cm|cn|cv|cz|de|dj|dk|dm|dz|ee|es|fi|fm|fr|ga|ge|gg|gl|gm|gp|gr|gy|hn|hr|ht|hu|ie|im|iq|is|it|je|jo|ki|kg|kz|la|li|lk|lt|lu|lv|md|me|mg|mk|ml|mn|ms|mu|mv|mw|ne|nl|no|nr|nu|pl|pn|ps|pt|ro|ru|rw|sc|se|sh|si|sk|sn|so|sm|sr|st|td|tg|tk|tl|tm|tn|to|tt|vg|vu|ws|rs|cat)|(com\.(af|ag|ai|ar|au|bd|bh|bn|bo|br|bz|co|cu|cy|do|ec|eg|et|fj|gh|gi|gt|hk|jm|kh|kw|lb|ly|mm|mt|mx|my|na|nf|ng|ni|np|om|pa|pe|pg|ph|pk|pr|py|qa|sa|sb|sg|sl|sv|tj|tr|tw|ua|uy|vc|vn))|(co\.(ao|bw|ck|cr|id|il|in|jp|ke|kr|ls|ma|mz|nz|th|tz|ug|uk|uz|ve|vi|za|zm|zw)))/finance\?noIL=1&q=[^&]+&ei=
        ^https?://www\.google\.com/(reader/link\?|buzz/post\?)
        ^https?://www\.google\.com/bookmarks/mark\?
        ^https?://www\.google\.com/recaptcha/(api|mailhide/d\?)
        ^https?://www\.infomous\.com/cloud_widget/lib/lib/
        ^https?://www\.khaleejtimes\.com/.+/images/.+/images/
        ^https?://www\.khaleejtimes\.com/.+/imgactv/.+/imgactv/
        ^https?://www\.khaleejtimes\.com/.+/kt_.+/kt_
        ^https?://www\.linkedin\.com/(cws/share|shareArticle)\?
        ^https?://www\.livejournal\.com/(tools/memadd|update|(identity/)?login)\.bml\?
        ^https?://www\.netvibes\.com/subscribe\.php\?
        ^https?://www\.newsvine\.com/_wine/save\?
        ^https?://www\.odnoklassniki\.ru/dk\?st\.cmd=addShare
        ^https?://www\.warnerbros\.com/\d+$
        ^https?://zakladki\.yandex\.ru/newlink\.xml\?
        ^https?://{primary_netloc}(/.*|/)page/%d/$
        ^https?://{primary_netloc}/(wp-admin/|wp-login\.php\?)
        ^https?://{primary_netloc}/.*%5Cx26route=/archive
        ^https?://{primary_netloc}/.*&amp;amp;amp;
        ^https?://{primary_netloc}/.*(\?|%5Cx26)route=(/page/:page|/archive/:year/:month|/tagged/:tag|/post/:id|/image/:post_id)
        ^https?://{primary_netloc}/.*amp%3Bamp%3Bamp%3B
        ^https?://{primary_netloc}/.+/%3Ca%20href=
        ^https?://{primary_netloc}/.+/jetpack-comment/\?blogid=\d+&postid=\d+
        ^https?://{primary_netloc}/.+/plugins/ultimate-social-media-plus/.+/like/like/
        ^https?://{primary_netloc}/.+/quote-comment-\d+/$
        ^https?://{primary_netloc}/.+[\?&](replyto(com)?|like_comment)=\d+
        ^https?://{primary_netloc}/.+[\?&]mode=reply
        ^https?://{primary_netloc}/.+[\?&]share=[a-z]{4,}
        ^https?://{primary_netloc}/.+\?showComment(=|%5C)\d+
        ^https?://{primary_netloc}/search(/label/[^\?]+|)\?updated-(min|max)=\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d.*&max-results=\d+
https://schoology.hsd.k12.or.us/ ...
Picked up the changes to /mnt/c/Users/Tyler/Documents/grabsite/ignores
Using these 182 ignores:
        %25252525
        /%22%20\+[^/]+\+%20%22
        /%22\+[^/]+\+%22
        /%27%20\+[^/]+\+%20%27
        /%27\+[^/]+\+%27
        /%5C/%5C/
        /'\+[^/]+\+'
        /(%5C)+(%22|%27)
        /App_Themes/.+/App_Themes/
        /\\+(%22|%27)
        /\\+["']
        /\\/\\/
        /bxSlider/.+/bxSlider/
        /bxSlider/bxSlider/
        /clientscript/.+/clientscript/clientscript/
        /clientscript/clientscript/.+/clientscript/
        /clientscript/clientscript/clientscript/
        /css/.+/css/css/
        /css/css/.+/css/
        /css/css/css/
        /images/.+/images/images/
        /images/images/.+/images/
        /images/images/images/
        /img/.+/img/img/
        /img/img/.+/img/
        /img/img/img/
        /js/.+/js/js/
        /js/js/.+/js/
        /js/js/js/
        /lib/exe/.*lib[-_]exe[-_]lib[-_]exe[-_]
        /scripts/.+/scripts/scripts/
        /scripts/scripts/.+/scripts/
        /scripts/scripts/scripts/
        /slides/.+/slides/slides/
        /slides/slides/.+/slides/
        /slides/slides/slides/
        /styles/.+/styles/styles/
        /styles/styles/.+/styles/
        /styles/styles/styles/
        ^https?://((s-)?static\.ak\.fbcdn\.net|(connect\.|www\.)?facebook\.com)/connect\.php/js/.*rsrc\.php
        ^https?://([^/]+\.)?gdcvault\.com(/.*/|/)(fonts(/.*/|/)fonts/|css(/.*/|/)css/|img(/.*/|/)img/)
        ^https?://([^\./]+\.)?stream\.publicradio\.org/
        ^https?://(\d|www|secure)\.gravatar\.com/avatar/ad516503a11cd5ca435acc9bb6523536
        ^https?://(apis|plusone)\.google\.com/_/\+1/
        ^https?://(audio\d?|nfw)\.video\.ria\.ru/
        ^https?://(ssl\.|www\.)?reddit\.com/(login\?dest=|submit\?|static/button/button)
        ^https?://(www\.)?digg\.com/submit\?
        ^https?://(www\.)?facebook\.com/(plugins/like(box)?\.php|sharer/sharer\.php|sharer?\.php|dialog/(feed|share))\?
        ^https?://(www\.)?filesonic\.com/
        ^https?://(www\.)?friendfeed\.com/share\?
        ^https?://(www\.)?instapaper\.com/hello2\?
        ^https?://(www\.)?megaupload\.com/
        ^https?://(www\.)?myspace\.com/Modules/PostTo/
        ^https?://(www\.)?pinterest\.com/pin/create/
        ^https?://(www\.)?stumbleupon\.com/(submit\?|badge/embed/)
        ^https?://(www\.)?technorati\.com/faves/?\?add=
        ^https?://(www\.)?twitter\.com/(share\?|intent/((re)?tweet|favorite)|home/?\?status=|\?status=)
        ^https?://(www\.)?wupload\.com/
        ^https?://(www\.)?xing\.com/(app/user\?op=share|social_plugins/share\?)
        ^https?://(www|draft)\.blogger\.com/(navbar\.g|post-edit\.g|delete-comment\.g|comment-iframe\.g|share-post\.g|email-post\.g|blog-this\.g|delete-backlink\.g|rearrange|blog_this\.pyra)\?
        ^https?://(www|px\.srvcs)\.tumblr\.com/(impixu\?|share(/link/?)?\?|reblog/)
        ^https?://(www|ssl)\.google-analytics\.com/(r/)?(__utm\.gif|collect\?)
        ^https?://.+/.+/disqus\.com/forums/$
        ^https?://.+/js-agent\.newrelic\.com/nr-\d{3}(\.min)?\.js$
        ^https?://.+/js/chartbeat\.js$
        ^https?://.+/stats\.g\.doubleclick\.net/dc\.js$
        ^https?://.+\.blogspot\.(com|in|com\.au|co\.uk|jp|co\.nz|ca|de|it|fr|se|sg|es|pt|com\.br|ar|mx|kr)/(\d{4}/\d{2}/|search/label/)(CSI/$|.*/CSI/CSI/CSI/)
        ^https?://[^/]*musicproxy\.s12\.de/
        ^https?://[^/]+/.+/CaptchaImage\.axd
        ^https?://[^/]+/anony/mjpg\.cgi$
        ^https?://[^/]+\.akadostream\.ru(:\d+)?/
        ^https?://[^/]+\.corp\.ne1\.yahoo\.com/
        ^https?://[^/]+\.facebook\.com/login\.php
        ^https?://[^/]+\.gaduradio\.pl/
        ^https?://[^/]+\.libsyn\.com/.+/%2[02]https?:/
        ^https?://[^/]+\.rastream\.com(:\d+)?/
        ^https?://[^/]+\.services\.livejournal\.com/ljcounter
        ^https?://[^/]+\.streamtheworld\.com/
        ^https?://[^/]+\.xiti\.com/hit\.xiti\?
        ^https?://[^\./]+\.radioscoop\.(com|net):\d+/
        ^https?://[^\./]+\.streamchan\.org:\d+/
        ^https?://[^\.]+\.livejournal\.com/.+/\*sup_ru/ru/UTF-8/
        ^https?://[^\.]+\.livejournal\.com/.+http://[^\.]+\.livejournal\.com/
        ^https?://\d+\.media\.tumblr\.com/avatar_.+_16\.png$
        ^https?://add\.my\.yahoo\.com/(rss|content)\?
        ^https?://air\.radiorecord\.ru(:\d+)?/
        ^https?://api\.addthis\.com/
        ^https?://audio\d?\.radioreference\.com/
        ^https?://audiots\.scdn\.arkena\.com/
        ^https?://av\.rasset\.ie/av/live/
        ^https?://b\.hatena\.ne\.jp/add\?
        ^https?://b\.scorecardresearch\.com/
        ^https?://bookmark\.naver\.com/post\?
        ^https?://bufferapp\.com/add\?
        ^https?://connect\.mail\.ru/share\?
        ^https?://csp\.cyworld\.com/bi/bi_recommend_pop\.php\?
        ^https?://del\.icio\.us/post\?
        ^https?://delicious\.com/(save|post)\?
        ^https?://download\.ted\.com/
        ^https?://flattr.com/submit/auto\?
        ^https?://gcnplayer\.gcnlive\.com/.+
        ^https?://geo\.yahoo\.com/b\?
        ^https?://getpocket\.com/save/?\?
        ^https?://i\.dev\.cdn\.turner\.com/
        ^https?://imageshack\.com/lost$
        ^https?://iwiw\.hu/pages/share/share\.jsp\?
        ^https?://mail\.google\.com/mail/
        ^https?://media\.opb\.org/clips/embed/.+\.js$
        ^https?://memori(\.qip)?\.ru/link/\?
        ^https?://mp3\.ffh\.de/
        ^https?://mp3tslg\.tdf-cdn\.com/
        ^https?://myweb2\.search\.yahoo\.com/myresults/bookmarklet\?
        ^https?://news\.ycombinator\.com/submitlink\?
        ^https?://p\.opt\.fimserve\.com/
        ^https?://photobucket\.com/.+/albums/.+/albums/
        ^https?://pixel\.blog\.hu/
        ^https?://pixel\.quantserve\.com/
        ^https?://pixel\.redditmedia\.com/pixel/
        ^https?://platform\d?\.twitter\.com/widgets/tweet_button.html\?
        ^https?://play(\d+)?\.radio13\.ru:8000/
        ^https?://plus\.google\.com/share\?
        ^https?://posterous\.com/share\?
        ^https?://prod-preview\.wired\.com/
        ^https?://pub(\d+)?\.di\.fm/
        ^https?://r-a-d\.io/.+\.mp3$
        ^https?://r-login\.wordpress\.com/remote-login\.php
        ^https?://relay\.broadcastify\.com/
        ^https?://reporter\.es\.msn\.com/\?fn=contribute
        ^https?://schoology\.hsd\.k12\.or\.us/logout
        ^https?://schoology\.hsd\.k12\.or\.us/settings
        ^https?://service\.weibo\.com/share/share\.php\?
        ^https?://share\.flipboard\.com/bookmarklet/popout\?
        ^https?://sphinn\.com/index\.php\?c=post&m=submit&
        ^https?://static\.licdn\.com/sc/p/.+/f//
        ^https?://static\.licdn\.com/sc/p/com\.linkedin\.nux(:|%3A)nux-static-content(\+|%2B)[\d\.]+/f/
        ^https?://stream(\d+)?\.media\.rambler\.ru/
        ^https?://tm\.uol\.com\.br/h/.+/h/
        ^https?://tmz\.vo\.llnwd\.net/
        ^https?://upload\.wikimedia\.org/wikipedia/[^/]+/thumb/
        ^https?://video-subtitle\.tedcdn\.com/
        ^https?://vkontakte\.ru/share\.php\?
        ^https?://vuible\.com/pins-settings/
        ^https?://web\.archive\.org/web/[^/]+/https?\:/[^/]+\.addthis\.com/.+/static/.+/static/
        ^https?://wow\.ya\.ru/posts_(add|share)_link\.xml\?
        ^https?://www\.addthis\.com/bookmark\.php\?
        ^https?://www\.addtoany\.com/(add_to/|share_save\?)
        ^https?://www\.blinklist\.com/index\.php\?Action=Blink/addblink\.php
        ^https?://www\.blogger\.com/feeds/\d+/\d+/comments/default/\d+
        ^https?://www\.blogger\.com/feeds/\d+/posts/default/\d+
        ^https?://www\.dreamwidth\.org/tools/(memadd|tellafriend)\?
        ^https?://www\.flickr\.com/(explore/|photos/[^/]+/(sets/\d+/(page\d+/)?)?)\d+_[a-f0-9]+(_[a-z])?\.jpg$
        ^https?://www\.flickr\.com/change_language\.gne
        ^https?://www\.google\.((com|ad|ae|al|am|as|at|az|ba|be|bf|bg|bi|bj|bs|bt|by|ca|cd|cf|cg|ch|ci|cl|cm|cn|cv|cz|de|dj|dk|dm|dz|ee|es|fi|fm|fr|ga|ge|gg|gl|gm|gp|gr|gy|hn|hr|ht|hu|ie|im|iq|is|it|je|jo|ki|kg|kz|la|li|lk|lt|lu|lv|md|me|mg|mk|ml|mn|ms|mu|mv|mw|ne|nl|no|nr|nu|pl|pn|ps|pt|ro|ru|rw|sc|se|sh|si|sk|sn|so|sm|sr|st|td|tg|tk|tl|tm|tn|to|tt|vg|vu|ws|rs|cat)|(com\.(af|ag|ai|ar|au|bd|bh|bn|bo|br|bz|co|cu|cy|do|ec|eg|et|fj|gh|gi|gt|hk|jm|kh|kw|lb|ly|mm|mt|mx|my|na|nf|ng|ni|np|om|pa|pe|pg|ph|pk|pr|py|qa|sa|sb|sg|sl|sv|tj|tr|tw|ua|uy|vc|vn))|(co\.(ao|bw|ck|cr|id|il|in|jp|ke|kr|ls|ma|mz|nz|th|tz|ug|uk|uz|ve|vi|za|zm|zw)))/finance\?noIL=1&q=[^&]+&ei=
        ^https?://www\.google\.com/(reader/link\?|buzz/post\?)
        ^https?://www\.google\.com/bookmarks/mark\?
        ^https?://www\.google\.com/recaptcha/(api|mailhide/d\?)
        ^https?://www\.infomous\.com/cloud_widget/lib/lib/
        ^https?://www\.khaleejtimes\.com/.+/images/.+/images/
        ^https?://www\.khaleejtimes\.com/.+/imgactv/.+/imgactv/
        ^https?://www\.khaleejtimes\.com/.+/kt_.+/kt_
        ^https?://www\.linkedin\.com/(cws/share|shareArticle)\?
        ^https?://www\.livejournal\.com/(tools/memadd|update|(identity/)?login)\.bml\?
        ^https?://www\.netvibes\.com/subscribe\.php\?
        ^https?://www\.newsvine\.com/_wine/save\?
        ^https?://www\.odnoklassniki\.ru/dk\?st\.cmd=addShare
        ^https?://www\.warnerbros\.com/\d+$
        ^https?://zakladki\.yandex\.ru/newlink\.xml\?
        ^https?://{primary_netloc}(/.*|/)page/%d/$
        ^https?://{primary_netloc}/(wp-admin/|wp-login\.php\?)
        ^https?://{primary_netloc}/.*%5Cx26route=/archive
        ^https?://{primary_netloc}/.*&amp;amp;amp;
        ^https?://{primary_netloc}/.*(\?|%5Cx26)route=(/page/:page|/archive/:year/:month|/tagged/:tag|/post/:id|/image/:post_id)
        ^https?://{primary_netloc}/.*amp%3Bamp%3Bamp%3B
        ^https?://{primary_netloc}/.+/%3Ca%20href=
        ^https?://{primary_netloc}/.+/jetpack-comment/\?blogid=\d+&postid=\d+
        ^https?://{primary_netloc}/.+/plugins/ultimate-social-media-plus/.+/like/like/
        ^https?://{primary_netloc}/.+/quote-comment-\d+/$
        ^https?://{primary_netloc}/.+[\?&](replyto(com)?|like_comment)=\d+
        ^https?://{primary_netloc}/.+[\?&]mode=reply
        ^https?://{primary_netloc}/.+[\?&]share=[a-z]{4,}
        ^https?://{primary_netloc}/.+\?showComment(=|%5C)\d+
        ^https?://{primary_netloc}/search(/label/[^\?]+|)\?updated-(min|max)=\d{4}-\d\d-\d\dT\d\d:\d\d:\d\d.*&max-results=\d+
http://schoology.hsd.k12.or.us/home.php ...
http://schoology.hsd.k12.or.us/login/ldap?&school=72935507 ...
https://schoology.hsd.k12.or.us/login/ldap?&school=72935507 ...
https://schoology.hsd.k12.or.us/home ...
https://schoology.hsd.k12.or.us/robots.txt ...
Fatal Python error: Segmentation fault

Thread 0x00007ffd9fff0700 (most recent call first):
  File "/usr/lib/python3.4/threading.py", line 290 in wait
  File "/usr/lib/python3.4/queue.py", line 167 in get
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 63 in _worker
  File "/usr/lib/python3.4/threading.py", line 868 in run
  File "/usr/lib/python3.4/threading.py", line 920 in _bootstrap_inner
  File "/usr/lib/python3.4/threading.py", line 888 in _bootstrap

Thread 0x00007ffdb4a00700 (most recent call first):
  File "/usr/lib/python3.4/threading.py", line 290 in wait
  File "/usr/lib/python3.4/queue.py", line 167 in get
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 63 in _worker
  File "/usr/lib/python3.4/threading.py", line 868 in run
  File "/usr/lib/python3.4/threading.py", line 920 in _bootstrap_inner
  File "/usr/lib/python3.4/threading.py", line 888 in _bootstrap

Thread 0x00007ffdb5250700 (most recent call first):
  File "/usr/lib/python3.4/threading.py", line 290 in wait
  File "/usr/lib/python3.4/queue.py", line 167 in get
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 63 in _worker
  File "/usr/lib/python3.4/threading.py", line 868 in run
  File "/usr/lib/python3.4/threading.py", line 920 in _bootstrap_inner
  File "/usr/lib/python3.4/threading.py", line 888 in _bootstrap

Thread 0x00007ffdb5a60700 (most recent call first):
  File "/usr/lib/python3.4/threading.py", line 290 in wait
  File "/usr/lib/python3.4/queue.py", line 167 in get
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 63 in _worker
  File "/usr/lib/python3.4/threading.py", line 868 in run
  File "/usr/lib/python3.4/threading.py", line 920 in _bootstrap_inner
  File "/usr/lib/python3.4/threading.py", line 888 in _bootstrap

Thread 0x00007ffdb62f0700 (most recent call first):
  File "/usr/lib/python3.4/threading.py", line 290 in wait
  File "/usr/lib/python3.4/queue.py", line 167 in get
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 63 in _worker
  File "/usr/lib/python3.4/threading.py", line 868 in run
  File "/usr/lib/python3.4/threading.py", line 920 in _bootstrap_inner
  File "/usr/lib/python3.4/threading.py", line 888 in _bootstrap

Thread 0x00007ffdb6b00700 (most recent call first):
  File "/usr/lib/python3.4/socket.py", line 187 in accept
  File "/home/tyler/.local/lib/python3.4/site-packages/manhole.py", line 196 in run
  File "/usr/lib/python3.4/threading.py", line 920 in _bootstrap_inner
  File "/usr/lib/python3.4/threading.py", line 888 in _bootstrap

Current thread 0x00007ffe3d750740 (most recent call first):
  File "/home/tyler/.local/lib/python3.4/site-packages/libgrabsite/dupes.py", line 36 in get_old_url
  File "/home/tyler/.local/lib/python3.4/site-packages/libgrabsite/plugin.py", line 34 in scrape_document
  File "/home/tyler/.local/lib/python3.4/site-packages/wpull/processor/web.py", line 455 in _handle_response
  File "/home/tyler/.local/lib/python3.4/site-packages/wpull/processor/web.py", line 339 in _fetch_one
  File "/home/tyler/.local/lib/python3.4/site-packages/trollius/tasks.py", line 259 in _step
  File "/home/tyler/.local/lib/python3.4/site-packages/trollius/tasks.py", line 355 in _wakeup
  File "/home/tyler/.local/lib/python3.4/site-packages/trollius/events.py", line 136 in _run
  File "/home/tyler/.local/lib/python3.4/site-packages/trollius/base_events.py", line 1217 in _run_once
  File "/home/tyler/.local/lib/python3.4/site-packages/trollius/base_events.py", line 309 in run_forever
  File "/home/tyler/.local/lib/python3.4/site-packages/trollius/base_events.py", line 338 in run_until_complete
  File "/home/tyler/.local/lib/python3.4/site-packages/wpull/app.py", line 118 in run_sync
  File "/home/tyler/.local/lib/python3.4/site-packages/wpull/__main__.py", line 40 in main
  File "/home/tyler/.local/lib/python3.4/site-packages/libgrabsite/main.py", line 383 in main
  File "/home/tyler/.local/lib/python3.4/site-packages/click/core.py", line 535 in invoke
  File "/home/tyler/.local/lib/python3.4/site-packages/click/core.py", line 895 in invoke
  File "/home/tyler/.local/lib/python3.4/site-packages/click/core.py", line 697 in main
  File "/home/tyler/.local/lib/python3.4/site-packages/click/core.py", line 722 in __call__
  File "/home/tyler/.local/bin/grab-site", line 4 in <module>
Segmentation fault (core dumped)
ivan commented 6 years ago

I cannot reproduce this segfault in Ubuntu 16.04.3 running on Windows 10 Fall Creators Update.