bellingcat / auto-archiver

Automatically archive links to videos, images, and social media content from Google Sheets (and more).
https://pypi.org/project/auto-archiver/
MIT License
489 stars 53 forks source link

AttributeError: 'HashEnricher' object has no attribute 'algorithm' #71

Closed milesmcc closed 1 year ago

milesmcc commented 1 year ago

I'm running into an interesting error when archiving a simple URL locally on my computer (macOS) (command is python3 -m auto_archiver --config orch.yaml --cli_feeder.urls="https://miles.land", version is 0.4.4): AttributeError: 'HashEnricher' object has no attribute 'algorithm'.

Here is my config:

steps:
  # only 1 feeder allowed
  feeder: cli_feeder
  archivers: # order matters, uncomment to activate
    # - vk_archiver
    # - telethon_archiver
    - telegram_archiver
    - twitter_archiver
    # - twitter_api_archiver
    # - instagram_tbot_archiver
    # - instagram_archiver
    - tiktok_archiver
    - youtubedl_archiver
    # - wayback_archiver_enricher
  enrichers:
    - hash_enricher
    - screenshot_enricher
    - thumbnail_enricher
    # - wayback_archiver_enricher
    - wacz_enricher

  formatter: html_formatter # defaults to mute_formatter
  storages:
    - local_storage
    # - s3_storage
    # - gdrive_storage
  databases:
    - console_db
    - csv_db
    # - gsheet_db
    # - mongo_db

configurations:
  screenshot_enricher:
    width: 1280
    height: 2300
  hash_enricher:
    algorithm: "SHA-256" # can also be SHA-256
  local_storage:
    save_to: "./local_archive"
    save_absolute: true
    filename_generator: static
    path_generator: flat

And here is the full log:

% python3 -m auto_archiver --config orch.yaml --cli_feeder.urls="https://miles.land"
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:108 - FEEDER: cli_feeder
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:109 - ENRICHERS: ['hash_enricher', 'screenshot_enricher', 'thumbnail_enricher', 'wacz_enricher']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:110 - ARCHIVERS: ['telegram_archiver', 'twitter_archiver', 'tiktok_archiver', 'youtubedl_archiver']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:111 - DATABASES: ['console_db', 'csv_db']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:112 - STORAGES: ['local_storage']
2023-03-14 11:52:44.363 | INFO     | auto_archiver.core.config:parse:113 - FORMATTER: html_formatter
2023-03-14 11:52:44.363 | DEBUG    | auto_archiver.feeders.cli_feeder:__iter__:28 - Processing https://miles.land
2023-03-14 11:52:44.364 | DEBUG    | auto_archiver.core.orchestrator:archive:66 - result.rearchivable=True for url='https://miles.land'
2023-03-14 11:52:44.364 | WARNING  | auto_archiver.databases.console_db:started:22 - STARTED Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[], rearchivable=True)
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver telegram_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver twitter_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver tiktok_archiver for https://miles.land
2023-03-14 11:52:44.364 | INFO     | auto_archiver.core.orchestrator:archive:87 - Trying archiver youtubedl_archiver for https://miles.land
[generic] Extracting URL: https://miles.land
[generic] miles: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] miles: Extracting information
ERROR: Unsupported URL: https://miles.land
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.archivers.youtubedl_archiver:download:37 - No video - Youtube normal control flow: ERROR: Unsupported URL: https://miles.land
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.enrichers.hash_enricher:enrich:30 - calculating media hashes for url='https://miles.land' (using SHA-256)
2023-03-14 11:52:45.051 | DEBUG    | auto_archiver.enrichers.screenshot_enricher:enrich:27 - Enriching screenshot for url='https://miles.land'
2023-03-14 11:52:53.272 | DEBUG    | auto_archiver.enrichers.thumbnail_enricher:enrich:23 - generating thumbnails
2023-03-14 11:52:53.273 | DEBUG    | auto_archiver.enrichers.wacz_enricher:enrich:35 - generating WACZ for url='https://miles.land'
2023-03-14 11:52:53.273 | INFO     | auto_archiver.enrichers.wacz_enricher:enrich:61 - Running browsertrix-crawler: docker run --rm -v /Users/miles/Desktop/tmpjdhyhj5x:/crawls/ webrecorder/browsertrix-crawler crawl --url https://miles.land --scopeType page --generateWACZ --text --collection dd5fef44 --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 90 --timeout 90
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.832Z","context":"general","message":"Page context being used with 1 worker","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.833Z","context":"general","message":"Set netIdleWait to 15 seconds","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:53.833Z","context":"general","message":"Seeds","details":[{"url":"https://miles.land/","include":[],"exclude":[],"scopeType":"page","sitemap":false,"allowHash":false,"maxExtraHops":0,"maxDepth":99999}]}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.094Z","context":"state","message":"Storing state in memory","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.416Z","context":"general","message":"Text Extraction: Enabled","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:54.515Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"limit":{"max":0,"hit":false},"pendingPages":["{\"url\":\"https://miles.land/\",\"seedId\":0,\"depth\":0,\"started\":\"2023-03-14T18:52:54.448Z\"}"]}}
{"logLevel":"error","timestamp":"2023-03-14T18:52:58.314Z","context":"general","message":"Invalid Seed \"mailto:hey@miles.land\" - URL must start with http:// or https://","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.338Z","context":"behavior","message":"Behaviors started","details":{"behaviorTimeout":90,"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.339Z","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://miles.land/","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.340Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"done!","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://miles.land/","page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.341Z","context":"pageStatus","message":"Page finished","details":{"page":"https://miles.land/"}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.391Z","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":1,"pending":0,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.391Z","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.396Z","context":"general","message":"Generating WACZ","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.398Z","context":"general","message":"Num WARC Files: 8","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.700Z","context":"general","message":"Validating passed pages.jsonl file\nReading and Indexing All WARCs\nWriting archives...\nWriting logs...\nGenerating page index from passed pages...\nHeader detected in the passed pages.jsonl file\nGenerating datapackage.json\nGenerating datapackage-digest.json\n","details":{}}
{"logLevel":"info","timestamp":"2023-03-14T18:52:58.737Z","context":"general","message":"Crawl status: done","details":{}}
2023-03-14 11:52:58.886 | ERROR    | auto_archiver.core.orchestrator:feed_item:44 - Got unexpected error on item Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[Media(filename='./tmpjdhyhj5x/screenshot_8df3af2d.png', key=None, urls=[], _mimetype='image/png', properties={'id': 'screenshot'}), Media(filename='/Users/miles/Desktop/tmpjdhyhj5x/collections/dd5fef44/dd5fef44.wacz', key=None, urls=[], _mimetype=None, properties={'id': 'browsertrix'})], rearchivable=True): 'HashEnricher' object has no attribute 'algorithm'
Traceback (most recent call last):
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/core/orchestrator.py", line 37, in feed_item
    return self.archive(item)
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/core/orchestrator.py", line 110, in archive
    s.store(m, result)  # modifies media
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/storages/storage.py", line 46, in store
    self.set_key(media, item)
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/storages/storage.py", line 78, in set_key
    he = HashEnricher({"algorithm": "SHA-256", "chunksize": 1.6e7})
  File "/Users/miles/.asdf/installs/python/3.10.1/lib/python3.10/site-packages/auto_archiver/enrichers/hash_enricher.py", line 18, in __init__
    assert self.algorithm in algo_choices, f"Invalid hash algorithm selected, must be one of {algo_choices} (you selected {self.algorithm})."
AttributeError: 'HashEnricher' object has no attribute 'algorithm'

2023-03-14 11:52:58.887 | ERROR    | auto_archiver.databases.console_db:failed:25 - FAILED Metadata(status='no archiver', _processed_at=datetime.datetime(2023, 3, 14, 18, 52, 44, 364021), metadata={'url': 'https://miles.land', 'folder': 'cli', 'tmp_dir': './tmpjdhyhj5x'}, media=[Media(filename='./tmpjdhyhj5x/screenshot_8df3af2d.png', key=None, urls=[], _mimetype='image/png', properties={'id': 'screenshot'}), Media(filename='/Users/miles/Desktop/tmpjdhyhj5x/collections/dd5fef44/dd5fef44.wacz', key=None, urls=[], _mimetype=None, properties={'id': 'browsertrix'})], rearchivable=True)
2023-03-14 11:52:58.887 | SUCCESS  | auto_archiver.feeders.cli_feeder:__iter__:30 - Processed 1 URL(s)

I can try to investigate and submit a PR, but figured I'd open the issue just to have.