ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

[BUG] Twitter pages potentially not downloading correctly #172

Closed Coloradohusky closed 3 years ago

Coloradohusky commented 3 years ago

Saved a list of Twitter pages (eg https://twitter.com/foofighters/status/1329662049639571457) with grab-site, grab-site --wpull-args "--monitor-disk" --wpull-args "--limit-rate 100000" --no-dupespotter sites.txt --delay 250-375 --1. Downloaded them all just fine... or so I thought. Viewing the pages with something like replayweb.page shows "Something went wrong, but don't fret - it's not your fault." Screenshot 2020-11-19 220418 Did the page actually download, but replayweb.page just can't show it, or can it actually not be seen? WARC below: twitter.com-foofighters-status-1329662049639571457-2020-11-20-f3367dc6-00000.warc.gz

ivan commented 3 years ago

For playback to work, Twitter should be grabbed like this to get the old non-React layout while it still exists:

# Get the old site instead of the React site
twitter_ua="NOT Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

grab-site --ua "$twitter_ua" --1 [... -i URLS_FILE or some starting URLs ...]

No --delay should be needed either.

Hope that helps, let me know.

Coloradohusky commented 3 years ago

Works, thanks!