bitdruid / python-wayback-machine-downloader

Query and download archive.org as simple as possible.
MIT License
33 stars 2 forks source link

[BUG] JSONDecodeError #20

Closed ekinimo closed 3 months ago

ekinimo commented 3 months ago

Encountered a bug while trying to download a website.

Command to reproduce waybackup -u http://www.radikal.com.tr --workers 4 --retry 10 -f --auto --start 2012

Terminal output

-----> 31206 snapshots injected
-------------------------
!-- Exception: UNCAUGHT EXCEPTION
!-- File: ../../../../../../../usr/lib/python3.11/json/decoder.py
!-- Function: raw_decode
!-- Line: 355
!-- Segment: raise JSONDecodeError("Expecting value", s, err.value) from None
!-- Description: Expecting value: line 31208 column 1 (char 4789934)
bitdruid commented 3 months ago

Thanks for your bug report!

After several cdx-queries I had the following scenarios:

  1. I ended up with about 7,000,000 snapshots with a cdx-file ~1GB which crashed the system when using it.
  2. requests.get somehow did not get the full JSON response (1GB!) and therefore the JSON response was not in a valid format.

So for 1. solutions would be to add a limit of snapshots received by the server (would be no problem as the cdx-server supports this kind of limiting) or waybackup itself would wait for user input if the amount of snapshots exceeds 1,000,000 (e.g.).

For 2. a solution would be either to eliminate this problem by a limit (see 1.) or to convert a partial result into valid json and then use it.

These solutions will always result in an incomplete download. The only way to get around this is to set a shorter range and split the query into several smaller jobs.

Conclusion I will try to implement the best trade-off off these ideas. Meanwhile for your bug just heavily reduce the range and let waybackup run several times in smaller ranges instead.