ArchiveBox / ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
https://archivebox.io
MIT License
20.88k stars 1.11k forks source link

Bug: Attempting to removed failed "Archive again" result relating to a pre 0.8.3 snapshot results in Archivebox attempting to delete EVERY entry in the database?! #1510

Open jessienab opened 1 week ago

jessienab commented 1 week ago

Describe the bug

Following up to: #1509

It seems ArchiveBox did eventually generate the "Archive again" entries for pre-0.8.3 snapshots, however it didn't archive them properly. When attempting to delete these, the following happened:

  1. The server.py/daphne was killed?
    daphne.server Application instance <Task pending name='Task-311' coro=<ProtocolTypeRouter.__call__() running at /usr/local/lib/python3.11/site-packages/channels/routing.py:62> wait_for=<Task
    cancelling name='Task-314' coro=<ASGIHandler.handle.<locals>.process_request() running at /usr/local/lib/python3.11/site-packages/django/core/handlers/asgi.py:185> wait_for=<Future pending 
    cb=[_chain_future.<locals>._call_check_cancel() at /usr/local/lib/python3.11/asyncio/futures.py:387, Task.task_wakeup()]> cb=[Task.task_wakeup()]>> for connection <WebRequest at 0x7729830880
    90 method=POST uri=/admin/core/snapshot/ clientproto=HTTP/1.1> took too long to shut down and was killed.
    daphne.server Application instance <Task cancelling name='Task-311' coro=<ProtocolTypeRouter.__call__() running at /usr/local/lib/python3.11/site-packages/channels/routing.py:62> wait_for=<_
    GatheringFuture pending cb=[Task.task_wakeup()]>> for connection <WebRequest at 0x772983088090 method=POST uri=/admin/core/snapshot/ clientproto=HTTP/1.1> took too long to shut down and was 
    killed.
  2. ArchiveBox then reports the following:

[i] Found 10958 matching URLs to remove.
10958 Links will be de-listed from the main index, and their archived content folders will be deleted from disk.
(9829 data folders with 70489 archived files will be deleted!)

I immediately killed ArchiveBox to prevent further damage, but at this point I'll have to restore from an older backup + manually re-grab a possibly large number of URLs for sites that weren't archived in that backup... :face_exhaling:

My fault! :woman_facepalming:

Steps to reproduce

  1. Attempt to re-snapshot a pre 0.8.3 snapshot
  2. It should fail with a 500 error and a specific error message
  3. The snapshots should eventually appear within the snapshot listings, but will not have been archived at all
  4. Attempt to delete those entries
  5. Depending on number of entries, ArchiveBox will then report through logs that it will delete effectively every entry...

Screenshots or log output

See above

ArchiveBox version

# archivebox version
0.8.3
ArchiveBox v0.8.3 COMMIT_HASH=31576e2 BUILD_TIME=2024-09-06 13:14:49 1725628489
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.6.47-1-lts-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=0:0 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.9         valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v5.1.1          valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.8.3          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v8.9.1          valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v20.17.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.1.54         valid     /app/node_modules/single-file-cli/single-file                               
 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor               
 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js                                  
 √  GIT_BINARY            v2.39.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2024.8.6       valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v128.0.6613     valid     /usr/bin/chromium-browser                                                   
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           34 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates                                                   

[i] Data locations:
 √  OUTPUT_DIR            9 files @       valid     /data                                                                       
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             168.7 MB        valid     ./index.sqlite3                                                             
 √  ARCHIVE_DIR           4995 files      valid     ./archive                                                                   
 √  SOURCES_DIR           1712 files      valid     ./sources                                                                   
 X  PERSONAS_DIR          missing         invalid   ./personas                                                                  
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 X  CACHE_DIR             missing         invalid   ./cache                                                                     
 X  CUSTOM_TEMPLATES_DIR  missing         invalid   ./templates
pirate commented 1 week ago

A shit, looks like some bug in the form parsing for the submit action selected all the snapshots?!

I'll investigate immediately, sorry about messing up your archive. I have several intergration tests that should prevent this type of thing around the CLI commands, but this shows I need to improve them to cover more of the UI button actions.

jessienab commented 1 week ago

A shit, looks like some bug in the form parsing for the submit action selected all the snapshots?!

I'll investigate immediately, sorry about messing up your archive. I have several intergration tests that should prevent this type of thing around the CLI commands, but this shows I need to improve them to cover more of the UI button actions.

No worries!! My fault not having functioning backups :) I managed to grab an older DB (3 months out of date), compiled all the URLs from sources/ up to now, and am just regrabbing. Seems no website data was deleted? so at least worst case if a website is missing now in the archive index, at least the older archived data is still present on disk (I can grep around to find it :+1: )

Thanks again and I guess lesson for me to make a backup (as you indicated and I did not read hehe) before running betas!!!

pirate commented 6 days ago

If the older data is still present on disk running archivebox init should also re-import it, as it will scan the archive/ folder for snapshot entries not in the DB and re-create them from the archive/<id>/index.json file saved with each snapshot output.

jessienab commented 5 days ago

If the older data is still present on disk running archivebox init should also re-import it, as it will scan the archive/ folder for snapshot entries not in the DB and re-create them from the archive/<id>/index.json file saved with each snapshot output.

Luck had it that I had setup rsnapshot, and I found the backup it made the day before I nuked ArchiveBox; everything restored! yay :D