bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
https://pypi.org/project/auto-archiver/
MIT License
489 stars 53 forks source link

generating WACZ without Docker - wacz not working #86

Closed djhmateer closed 10 months ago

djhmateer commented 10 months ago

Getting a proxy connection failed on the wacz_archiver_enricher on all urls.

First time I've set this up, so probably something simple / maybe I've missed something.

Next step for me is to setup a local dev version and debug it.. but this issue may be useful for others at the same stage as me.

I have the profile setup in secrets/profile.tar.gz which I did via

# create a new profile
docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles/ -it webrecorder/browsertrix-crawler create-login-profile --url "https://twitter.com/"

Output of the run is:

docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml

2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:111 - FEEDER: gsheet_feeder
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:112 - ENRICHERS: ['hash_enricher', 'wacz_archiver_enricher']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:113 - ARCHIVERS: ['wacz_archiver_enricher']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:114 - DATABASES: ['gsheet_db']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:115 - STORAGES: ['local_storage']
2023-08-22 10:50:17.450 | INFO     | auto_archiver.core.config:parse:116 - FORMATTER: html_formatter
2023-08-22 10:50:24.319 | INFO     | auto_archiver.feeders.gsheet_feeder:__iter__:48 - Opening worksheet ii=0: wks.title='Sheet1' header=1
2023-08-22 10:50:26.275 | WARNING  | auto_archiver.databases.gsheet_db:started:28 - STARTED Metadata(status='no archiver', metadata={'_processed_at': datetime.datetime(2023, 8, 22, 10, 50, 26, 274503), 'url': 'https://twitter.com/dave_mateer/status/1505876265504546817'}, media=[])
2023-08-22 10:50:26.916 | INFO     | auto_archiver.core.orchestrator:archive:85 - Trying archiver wacz_archiver_enricher for https://twitter.com/dave_mateer/status/1505876265504546817
2023-08-22 10:50:26.916 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:52 - generating WACZ without Docker for url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:50:26.916 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:96 - Running browsertrix-crawler: crawl --url https://twitter.com/dave_mateer/status/1505876265504546817 --scopeType page --generateWACZ --text --screenshot fullPage --collection 5e60e6e9 --id 5e60e6e9 --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 120 --timeout 120 --profile /app/secrets/profile.tar.gz
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"Browsertrix-Crawler 0.10.3 (with warcio.js 1.6.2 pywb 2.7.4)","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"Seeds","details":[{"url":"https://twitter.com/dave_mateer/status/1505876265504546817","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}
{"logLevel":"info","timestamp":"2023-08-22T10:50:27.983Z","context":"general","message":"With Browser Profile","details":{"url":"/app/secrets/profile.tar.gz"}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.205Z","context":"worker","message":"Creating 1 workers","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.205Z","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.263Z","context":"browser","message":"Disabling Service Workers for profile","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.269Z","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817"}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:30.270Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":null,"total":null,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2023-08-22T10:50:30.206Z\",\"url\":\"https://twitter.com/dave_mateer/status/1505876265504546817\",\"added\":\"2023-08-22T10:50:28.119Z\",\"depth\":0}"]}}
{"logLevel":"info","timestamp":"2023-08-22T10:50:32.373Z","context":"general","message":"Awaiting page load","details":{"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:02.379Z","context":"general","message":"Page Load Error, skipping page","details":{"msg":"net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817","page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:02.379Z","context":"worker","message":"Unknown exception","details":{"type":"exception","message":"net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817","stack":"Error: net::ERR_TIMED_OUT at https://twitter.com/dave_mateer/status/1505876265504546817\n    at navigate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:98:23)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Deferred.race (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:79:20)\n    at async Frame.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:64:21)\n    at async CDPPage.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:578:16)\n    at async Crawler.loadPage (file:///app/crawler.js:1062:20)\n    at async Crawler.default [as driver] (file:///app/defaultDriver.js:3:3)\n    at async Crawler.crawlPage (file:///app/crawler.js:451:5)\n    at async PageWorker.timedCrawlPage (file:///app/util/worker.js:151:7)\n    at async PageWorker.runLoop (file:///app/util/worker.js:192:9)","workerid":0}}
{"logLevel":"warn","timestamp":"2023-08-22T10:51:02.380Z","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.386Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.483Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.483Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Num WARC Files: 0","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:02.485Z","context":"general","message":"Crawl status: done","details":{}}
2023-08-22 10:51:02.489 | WARNING  | auto_archiver.enrichers.wacz_enricher:enrich:108 - Unable to locate and upload WACZ  filename='collections/5e60e6e9/5e60e6e9.wacz'
2023-08-22 10:51:02.490 | DEBUG    | auto_archiver.enrichers.hash_enricher:enrich:31 - calculating media hashes for url='https://twitter.com/dave_mateer/status/1505876265504546817' (using SHA3-512)
2023-08-22 10:51:02.490 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:52 - generating WACZ without Docker for url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:51:02.490 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:96 - Running browsertrix-crawler: crawl --url https://twitter.com/dave_mateer/status/1505876265504546817 --scopeType page --generateWACZ --text --screenshot fullPage --collection c851aa3f --id c851aa3f --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 120 --timeout 120 --profile /app/secrets/profile.tar.gz
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.099Z","context":"general","message":"Browsertrix-Crawler 0.10.3 (with warcio.js 1.6.2 pywb 2.7.4)","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.100Z","context":"general","message":"Seeds","details":[{"url":"https://twitter.com/dave_mateer/status/1505876265504546817","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":1000000}]}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.100Z","context":"general","message":"With Browser Profile","details":{"url":"/app/secrets/profile.tar.gz"}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.653Z","context":"worker","message":"Creating 1 workers","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.653Z","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.703Z","context":"browser","message":"Disabling Service Workers for profile","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.710Z","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817"}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:03.710Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":null,"total":null,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2023-08-22T10:51:03.654Z\",\"url\":\"https://twitter.com/dave_mateer/status/1505876265504546817\",\"added\":\"2023-08-22T10:51:03.157Z\",\"depth\":0}"]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.392Z","context":"general","message":"Awaiting page load","details":{"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:05.398Z","context":"general","message":"Page Load Error, skipping page","details":{"msg":"net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817","page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"error","timestamp":"2023-08-22T10:51:05.399Z","context":"worker","message":"Unknown exception","details":{"type":"exception","message":"net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817","stack":"Error: net::ERR_PROXY_CONNECTION_FAILED at https://twitter.com/dave_mateer/status/1505876265504546817\n    at navigate (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:98:23)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Deferred.race (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/util/Deferred.js:79:20)\n    at async Frame.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Frame.js:64:21)\n    at async CDPPage.goto (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:578:16)\n    at async Crawler.loadPage (file:///app/crawler.js:1062:20)\n    at async Crawler.default [as driver] (file:///app/defaultDriver.js:3:3)\n    at async Crawler.crawlPage (file:///app/crawler.js:451:5)\n    at async PageWorker.timedCrawlPage (file:///app/util/worker.js:151:7)\n    at async PageWorker.runLoop (file:///app/util/worker.js:192:9)","workerid":0}}
{"logLevel":"warn","timestamp":"2023-08-22T10:51:05.399Z","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://twitter.com/dave_mateer/status/1505876265504546817","workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.409Z","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.540Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"failed":1,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.540Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.542Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.543Z","context":"general","message":"Num WARC Files: 0","details":{}}
{"logLevel":"info","timestamp":"2023-08-22T10:51:05.543Z","context":"general","message":"Crawl status: done","details":{}}
2023-08-22 10:51:05.549 | WARNING  | auto_archiver.enrichers.wacz_enricher:enrich:108 - Unable to locate and upload WACZ  filename='collections/c851aa3f/c851aa3f.wacz'
2023-08-22 10:51:05.549 | DEBUG    | auto_archiver.formatters.html_formatter:format:37 - [SKIP] FORMAT there is no media or metadata to format: url='https://twitter.com/dave_mateer/status/1505876265504546817'
2023-08-22 10:51:05.549 | SUCCESS  | auto_archiver.databases.gsheet_db:done:46 - DONE https://twitter.com/dave_mateer/status/1505876265504546817
2023-08-22 10:51:06.365 | SUCCESS  | auto_archiver.feeders.gsheet_feeder:__iter__:79 - Finished worksheet Sheet1

and orchestation.yaml is:

steps:
  # only 1 feeder allowed
  feeder: gsheet_feeder # defaults to cli_feeder
  archivers: # order matters, uncomment to activate
    # - vk_archiver
    # - telethon_archiver
    # - telegram_archiver
    # - twitter_archiver
    #- twitter_api_archiver
    # - instagram_tbot_archiver
    # - instagram_archiver
    # - tiktok_archiver
    # - youtubedl_archiver
    # - wayback_archiver_enricher
    - wacz_archiver_enricher
  enrichers:
    - hash_enricher
    # - metadata_enricher
    # - screenshot_enricher
    # - thumbnail_enricher
    # - wayback_archiver_enricher
    - wacz_archiver_enricher
    # - pdq_hash_enricher # if you want to calculate hashes for thumbnails, include this after thumbnail_enricher
  formatter: html_formatter # defaults to mute_formatter
  storages:
    - local_storage
    # - s3_storage
    # - gdrive_storage
  databases:
    #- console_db
    # - csv_db
    - gsheet_db
    # - mongo_db

configurations:
  gsheet_feeder:
    sheet: "AA Demo Main"
    header: 1
    service_account: "secrets/service_account.json"
    # allow_worksheets: "only parse this worksheet"
    # block_worksheets: "blocked sheet 1,blocked sheet 2"
    use_sheet_names_in_stored_paths: false
    columns:
      url: link
      status: archive status
      folder: destination folder
      archive: archive location
      date: archive date
      thumbnail: thumbnail
      timestamp: upload timestamp
      title: upload title
      text: textual content
      screenshot: screenshot
      hash: hash
      pdq_hash: perceptual hashes
      wacz: wacz
      replaywebpage: replaywebpage
  instagram_tbot_archiver:
    api_id: "TELEGRAM_BOT_API_ID"
    api_hash: "TELEGRAM_BOT_API_HASH"
    # session_file: "secrets/anon"
  telethon_archiver:
    api_id: "TELEGRAM_BOT_API_ID"
    api_hash: "TELEGRAM_BOT_API_HASH"
    # session_file: "secrets/anon"
    join_channels: false
    channel_invites: # if you want to archive from private channels
      - invite: https://t.me/+123456789
        id: 0000000001
      - invite: https://t.me/+123456788
        id: 0000000002

  twitter_api_archiver:
    # either bearer_token only
    # bearer_token: "TWITTER_BEARER_TOKEN"

  instagram_archiver:
    username: "INSTAGRAM_USERNAME"
    password: "INSTAGRAM_PASSWORD"
    # session_file: "secrets/instaloader.session"

  vk_archiver:
    username: "or phone number"
    password: "vk pass"
    session_file: "secrets/vk_config.v2.json"

  screenshot_enricher:
    width: 1280
    height: 2300
  wayback_archiver_enricher:
    timeout: 10
    key: "wayback key"
    secret: "wayback secret"
  hash_enricher:
    algorithm: "SHA3-512" # can also be SHA-256
  wacz_archiver_enricher:
    profile: secrets/profile.tar.gz
  local_storage:
    save_to: "./local_archive"
    save_absolute: true
    filename_generator: static
    path_generator: flat
  s3_storage:
    bucket: your-bucket-name
    region: reg1
    key: S3_KEY
    secret: S3_SECRET
    endpoint_url: "https://{region}.digitaloceanspaces.com"
    cdn_url: "https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key}"
    # if private:true S3 urls will not be readable online
    private: false
    # with 'random' you can generate a random UUID for the URL instead of a predictable path, useful to still have public but unlisted files, alternative is 'default' or not omitted from config
    key_path: random
  gdrive_storage:
    path_generator: url
    filename_generator: random
    root_folder_id: folder_id_from_url
    oauth_token: secrets/gd-token.json # needs to be generated with scripts/create_update_gdrive_oauth_token.py
    service_account: "secrets/service_account.json"
  csv_db:
    csv_file: "./local_archive/db.csv"
msramalho commented 10 months ago

Thank you for opening the issue, I've recently stumbled upon the same problem.

I'll try to look into it soon since it effectively blocks wacz collections when using a docker deployment.

msramalho commented 10 months ago

After investigating the bug was introduced in this commit: https://github.com/bellingcat/auto-archiver/commit/987bbcaad083310791dda98687c24b9748089cfe

What happened? we removed pywb dependency from Pipfile thinking it was not being used but it is required so that browsertrix-crawler can work, it is installed in their own Dockerfile but since we use pipenv instead of the default pip installation it was not being accessed, and hence needs to be added explicitly to the Pipfile.

djhmateer commented 10 months ago

Thank you @msramalho it is working for me now!