HaveIBeenPwned / PwnedPasswordsDownloader

A tool to download all Pwned Passwords hash ranges and save them offline so they can be used without a dependency on the k-anonymity API
BSD 3-Clause "New" or "Revised" License
629 stars 49 forks source link

Downloader reports success, but the result appears to be incomplete #16

Closed benblank closed 1 year ago

benblank commented 1 year ago

I'm using haveibeenpwned-downloader for the first time and it does not appear to be behaving correctly.

When running the downloader with only the destination file and the overwrite flag (see log below), it first starts as expected. After fluctuating a bit, the estimated time remaining settles in the 90-120 minute range. For several minutes, everything seems normal — the percentage ticks up, the time remaining ticks down, and while there are occasional failed attempts, I've never seen one with a number other than 1.

However, the process must run into a problem at some point, because coming back to it after some time, it reports that it has "finished downloading all hash ranges" and the estimated time remaining is zero, but the percentage complete is in the 10-15% range, the time taken is 20-25 minutes, and the output file is much smaller than expected (~5GB rather than 35-40GB) and appears to be truncated.

I have been through this process a few times and the result is always be similar. The terminal output from my latest attempt is below.

VersionNote
Windows10.0.19045.2546"Windows 10 22H2"
dotnet6.0.405
haveibeenpwned-downloader0.2.7installed by running dotnet tool install --global haveibeenpwned-downloader
Terminal output of haveibeenpwned-downloader session ```text PS G:\> haveibeenpwned-downloader .\hibp-pwned-passwords-sha1 --overwrite Failed attempt #1 fetching https://api.pwnedpasswords.com/range/01496. Response contained HTTP Status code ServiceUnavailable. Failed attempt #1 fetching https://api.pwnedpasswords.com/range/0461D. Response contained HTTP Status code ServiceUnavailable. Failed attempt #1 fetching https://api.pwnedpasswords.com/range/073D2. Response contained HTTP Status code ServiceUnavailable. Failed attempt #1 fetching https://api.pwnedpasswords.com/range/119FF. Response contained HTTP Status code ServiceUnavailable. Failed attempt #1 fetching https://api.pwnedpasswords.com/range/13AB1. Response contained HTTP Status code ServiceUnavailable. Hash ranges downloaded ---------------------------------------- 13% 00:00:00 Finished downloading all hash ranges in 1,358,036ms (103.12 hashes per second). We made 140,168 Cloudflare requests (avg response time: 100.99ms). Of those, Cloudflare had already cached 139,539 requests, and made 629 requests to the Have I Been Pwned origin server. PS G:\> Get-ChildItem *.txt Directory: G:\ Mode LastWriteTime Length Name ---- ------------- ------ ---- -a---- 1/23/2023 9:40 AM 5012458890 hibp-pwned-passwords-sha1.txt -a---- 12/2/2021 1:04 AM 37342268646 pwned-passwords-sha1-ordered-by-hash-v8.txt PS G:\> Get-Content .\hibp-pwned-passwords-sha1.txt -Tail 10 22306FBAF0C21F1D6137F13286E895A0116351E4:1 22306FC0DAD8C4EE00D26E26551ABB8336D6E76F:6 22306FC3ED89262785AAAB6126496AD390234D18:1 22306FCCB66A3CDD8DCF6BD81294FAF4427684EC:3 22306FCDB739B36ECA4FAE9F55134123EB4C84B4:10 22306FEF43B51A26138BD7A23D96289E35C3754D:4 22306FF4A933CE31819AAD29AA32E443C33C3CF5:1 22306FF689DB0ED32BB0F43E3E9CB37F9E3F78E2:3 22306FFA61DE34B22C1AE78DDBC94AFF238D6EEB:2 22306FFCE5F392AB8778DAD45DAF07311525D28D:2 PS G:\> cmd /c ver Microsoft Windows [Version 10.0.19045.2546] PS G:\> dotnet --version 6.0.405 PS G:\> dotnet tool list --global Package Id Version Commands --------------------------------------------------------------------- haveibeenpwned-downloader 0.2.7 haveibeenpwned-downloader ```
henricj commented 1 year ago

A https://api.pwnedpasswords.com/SHA512containing hashes for all those files sure would be nice. One could then confirm that the download completed successfully (and that bitrot hadn't corrupted anything after the fact). That same SHA512 file could even be used to verify the single file, with some extra tooling.

One might also consider adding a SHA512.gpg file on top of that.

benblank commented 1 year ago

A https://api.pwnedpasswords.com/SHA512containing hashes for all those files

That's a really good idea. When I was creating this issue, I was thinking about how to continue halted downloads or update without a full re-download (potentially). A hash file would fit both those use-cases nicely!

troyhunt commented 1 year ago

We deliberately moved away from that model as creating a single monolithic file comprised of 1M+ constantly changing parts was a nightmare to maintain. The downloader here gets around that by pulling directly from the API which is also heavily cached at Cloudflare. That said, the original issue raised by @benblank needs to be addressed, copying @stebet on that one 🙂

stebet commented 1 year ago

I'll take a look at this. Thanks for the detailed report.

tomudding commented 1 year ago

I had the same issue with v0.2.8 on my first run:

PwnedPasswords (In)Complete Download

Second run did not have this problem.

henricj commented 1 year ago

We deliberately moved away from that model as creating a single monolithic file comprised of 1M+ constantly changing parts was a nightmare to maintain. The downloader here gets around that by pulling directly from the API which is also heavily cached at Cloudflare. That said, the original issue raised by @benblank needs to be addressed, copying @stebet on that one 🙂

Is there some way for the downloader to know that one has at gotten a valid chunk downloaded? Reading them all, waiting a while, and then reading each chunk again to compare (repeating the process until two sequential downloads of a given chunk match) is the only thing I could think of, and is slow and wasteful of resources. IIRC, CloudFlare passes through custom headers, so a per-chunk x-hibp-digest: sha256=XXXX= should be possible without too much fuss.

Also, I made some changes here that get HTTP/2 working (the servers didn't seem to want to talk HTTP/3), stops it from creating a string for each of the lines parsed, and distributes the chunks across multiple directories (when not downloading a single file). I see there are some recent changes that are not yet incorporated, and I'm not done poking it...

benblank commented 1 year ago

a per-chunk x-hibp-digest: sha256=XXXX=

With a little bookkeeping, a new header isn't even necessary. CloudFlare appears to fully support the Last-Modified response header and If-Modified-Since request header, meaning that if you track when each chunk was last modified (or even if you can simply trust the filesystem's modification time, for individual files), you can get CF to respond with 304s rather than full the full contents.

A quick demonstration using curl ```text $ curl --head https://api.pwnedpasswords.com/range/01496 HTTP/2 200 date: Fri, 10 Feb 2023 17:25:27 GMT content-type: text/plain cf-ray: 79768b519f4fcef1-SJC access-control-allow-origin: * age: 13763 cache-control: public, max-age=2678400 expires: Mon, 13 Mar 2023 17:25:27 GMT last-modified: Sun, 28 Aug 2022 00:44:15 GMT strict-transport-security: max-age=31536000; includeSubDomains; preload vary: Accept-Encoding cf-cache-status: HIT arr-disable-session-affinity: True request-context: appId=cid-v1:639b3d62-d78b-45f0-8442-2b7f52b50c2e x-content-type-options: nosniff set-cookie: __cf_bm=0de8q6EO5brZGw3zfj2g6GAtMdFOOuR73WQ7WzSoVyM-1676049927-0-AZvPQFoqNhPzf1x5nPlWJR7qsrpg0rui7xKSIWKbu1rmHbQ4YDcpztnqvpSn9LqvIvKmRXnXcKnlqVB0iJLFaWs=; path=/; expires=Fri, 10-Feb-23 17:55:27 GMT; domain=.pwnedpasswords.com; HttpOnly; Secure; SameSite=None server: cloudflare $ curl --head --header "If-Modified-Since: Sun, 28 Aug 2022 00:44:15 GMT" https://api.pwnedpasswords.com/range/01496 HTTP/2 304 date: Fri, 10 Feb 2023 17:26:16 GMT cf-ray: 79768c7f3854f9ea-SJC access-control-allow-origin: * age: 13812 cache-control: public, max-age=2678400 expires: Mon, 13 Mar 2023 17:26:16 GMT last-modified: Sun, 28 Aug 2022 00:44:15 GMT strict-transport-security: max-age=31536000; includeSubDomains; preload vary: Accept-Encoding cf-cache-status: HIT arr-disable-session-affinity: True request-context: appId=cid-v1:639b3d62-d78b-45f0-8442-2b7f52b50c2e x-content-type-options: nosniff set-cookie: __cf_bm=UP4ozngI7A9TU30I217KyRUVWMH6cwfwgVs6ZOCQPVg-1676049976-0-AReEQ/lIrun8MQ/f9PuzC8lGohUNYXHimV3FFb20iEJtQlvilKUYrUqUq+Pp/CkwJmMoyRo9o8107oSgedSWbPM=; path=/; expires=Fri, 10-Feb-23 17:56:16 GMT; domain=.pwnedpasswords.com; HttpOnly; Secure; SameSite=None server: cloudflare ```
henricj commented 1 year ago

a per-chunk x-hibp-digest: sha256=XXXX=

With a little bookkeeping, a new header isn't even necessary. CloudFlare appears to fully support the Last-Modified response header and If-Modified-Since request header, meaning that if you track when each chunk was last modified (or even if you can simply trust the filesystem's modification time, for individual files), you can get CF to respond with 304s rather than full the full contents.

That would allow for incremental updates, but with possible corporate proxies, inconvenient power outages, and who-knows-what nonsense between the client and the server, how can one even confirm that the whole chunk was downloaded? The individual pieces are small, but all together it is a bit of data and when dealing with non-trivial data sets, expecting deterministic behavior from complex systems is optimistic (not that a couple of gigabytes is huge). There's good reason for using ECC memory and for ZFS' checksums on everything. Providing a header with a known checksum algorithm (be that a custom one, an ETag, or what-not) or at the very least a length would make a download client a whole lot more robust, particularly when restarting an interrupted transfer. To put it another way, if there is only a one-in-a-billion chance of something wonky happening during a chunk transfer, then that means a one-in-a-thousand chance when transferring a million chunks.

BTW, I think I ran into the interrupted download problem when mucking around with the client code and I think it was caused by the retries only applying to the start of the transfer, not when the .ResponseMessage.Content is read. The HttpCompletionOption.ResponseHeadersRead passed to HttpClient.SendAsync() lets that bit complete before the content is read. (So, why not just use HttpClient.GetAsync() instead?) IIRC, when I saw it, the problem popped up when the server closed the connection between when the .SendAsync() completed and when reading the contents began.

modrobert commented 1 year ago

I'm not sure if this qualifies as a workaround, but it seems to work more reliable (less incomplete downloads) when using the parallellism flag, e.g:

haveibeenpwned-downloader.exe pwnedpasswords -p 8
DM-Francis commented 1 year ago

I ran into the same issue, the download only ran up to 32% (first run) and 73% (second run). I tried the suggestion a bit higher up about removing the HttpCompletionOption.ResponseHeadersRead option and replacing the call with HttpClient.GetAsync() in a local copy of the source. That seemed to work and it fully downloaded all the hashes. It appears like there was one of those content errors caught along the way: image

I'll create a PR with the change.

modrobert commented 1 year ago

I'll create a PR with the change.

Your patch seems to work, thanks.

stebet commented 1 year ago

Good catch @DM-Francis, I'm going through bugs and issues now and working on stabilizing things. I'll be merging your PR to fix this issue. Let's see if we can then close this :)