Closed muzso closed 1 month ago
Hey @muzso, that is an awesome approach and I'm kind a wondering why we never did that in the first place! I've added a reference to the readme, well done!
@troyhunt
Thanks for the mention in the readme.
makes use of a regular expression
It's not a regular expression. Curl calls it "globbing", which is a simple method for specification of string patterns.
See:
Fixed!
Thanks.
For some reason this skipped a single file, A1FBF
, for me. After downloading that missing file manually I had the complete set of 1048576 files.
@dextercd Thanks for the feedback. I've added some explanation to this issue's description to aid debugging any problems with the download.
I've also added the --retry 10
and --retry-all-errors
options and my tests showed that this combination provides a complete download of all files. For some reason just the --retry 10
option seemed not to be enough.
I've tested this (with the --retry 10
and --retry-all-errors
options) on a pretty different platform to double-check that it works as advertised: I ran it on my Google Pixel 6a phone (Android 15) via Termux and sshd and using the WiFi of my residential internet access (merely 150 Mbps).
It ran for 34 minutes (likely due to the more modest network parameters) and finished successfully, all 1048576 files were downloaded (with OK looking file sizes). :)
Nice work. In my experience awk
is slow. So I tried with sed
which is usually faster. However, sed can't open multiple files at once and be aware of their filename. So xargs
with -I{}
will call sed
once per file, which ends up being slightly slower than your awk.
However I managed to get 25-30% speedup by moving the FILENAME manipulation and the removal of the \r
from awk:
oliver@alpha:~/hibp/data$ time find . -type f -printf '%f\n' | egrep -ia '^[0-9a-f]{5}$' | head -n10000 | xargs awk -F: '{ print FILENAME $1 ":" $2 }' | tr -d '\r' > hibp_all2.txt
real 0m13.353s
user 0m13.062s
sys 0m1.846s
oliver@alpha:~/hibp/data$ time find . -type f -print | egrep -ia '/[0-9a-f]{5}$' | head -n10000 | xargs -r -d '\n' awk -F: '{ sub(/\r$/,""); print substr(FILENAME, length(FILENAME)-4, 5) $1 ":" $2 }' > hibp_all.txt
real 0m18.645s
user 0m17.648s
sys 0m0.917s
oliver@alpha:~/hibp/data$ diff hibp_all.txt hibp_all2.txt
[ no differences ]
above shown for 10000 files. So the suggested "improved" command line is:
find . -type f -printf '%f\n' | egrep -ia '^[0-9a-f]{5}$' | xargs awk -F: '{ print FILENAME $1 ":" $2 }' | tr -d '\r' > hibp_all.txt
@oschonrock
In my experience awk is slow. So I tried with sed which is usually faster. However, sed can't open multiple files at once and be aware of their filename. So xargs with -I{} will call sed once per file, which ends up being slightly slower than your awk.
That's the exact reason (1 million invocations of sed) why I didn't even give it (sed) a try.
However I managed to get 25-30% speedup by moving the FILENAME manipulation and the removal of the \r from awk
Great! :)
You've motivated me to spend some time/effort on performance optimization too. :)
As a first step, I've collected the input filenames into a file so the measurements are as little influenced by other factors as possible.
find . -type f -printf "%f\\n" | egrep -a "^[0-9A-F]{5}\$" | LC_ALL=C sort > ../filenames.txt
I've also created a ramdisk and copied the input files to it, so my SSD won't be a factor either. In practise this didn't make much difference though as you'll see from the measurements.
I've included a run for the same code and both using my SSD and using the ramdisk, but the difference is within the margin of error (of measurement accuracy). For the other/later tests I only used the ramdisk.
I've done quite a few variations.
# My original variant.
/usr/bin/time -p sh -c 'cat ../filenames.txt | xargs -r -d "\\n" awk -F: "{ sub(/\\r\$/,\"\"); print substr(FILENAME, length(FILENAME)-4, 5) \$1 \":\" \$2 }" > ../hibp_all_01.txt'
# run #1 (ssd)
real 835.48
user 773.24
sys 37.69
# run #2 (ramdisk)
real 824.02
user 760.14
sys 34.56
# This is your (@oschonrock) version.
# Removed the cutting of the file pathes since they already contain only the filenames, and moved the handling of carriage returns out of Awk and into a `tr` command.
/usr/bin/time -p sh -c 'cat ../filenames.txt | xargs -r -d "\\n" awk -F: "{ print FILENAME \$1 \":\" \$2 }" | tr -d "\\r" > ../hibp_all_02.txt'
# run #1 (ramdisk)
real 534.57
user 548.57
sys 54.11
# run #2 (ramdisk)
real 547.37
user 548.78
sys 54.11
# Removed the splitting of the lines (into fields) since we don't actually change the content. Originally I used fields so I could swap them if necessary.
# This resulted in a significant speed improvement.
/usr/bin/time -p sh -c 'cat ../filenames.txt | xargs -r -d "\\n" awk -F_ "{ print FILENAME \$0 }" | tr -d "\\r" > ../hibp_all_03.txt'
# run #1 (ramdisk)
real 144.48
user 148.94
sys 51.33
# run #2 (ramdisk)
real 176.83
user 152.15
sys 52.85
# Replaced the invocation of `tr` with Awk code.
# Note: the last line of the input files has no (carriage return) line ending which requires a conditional check.
/usr/bin/time -p sh -c 'cat ../filenames.txt | xargs -r -d "\\n" awk -F_ "{ n = length(\$0); c = substr(\$0, n, 1); if (c == \"\\r\") { print substr(FILENAME, length(FILENAME)-4, 5) substr(\$0, 1, n-1) } else { print substr(FILENAME, length(FILENAME)-4, 5) \$0 } }" > ../hibp_all_04.txt'
# run #1 (ramdisk)
real 360.04
user 325.25
sys 31.98
# run #2 (ramdisk)
real 372.78
user 333.47
sys 32.89
# Removed cutting of the file pathes.
/usr/bin/time -p sh -c 'cat ../filenames.txt | xargs -r -d "\\n" awk -F_ "{ n = length(\$0); c = substr(\$0, n, 1); if (c == \"\\r\") { print FILENAME substr(\$0, 1, n-1) } else { print FILENAME \$0 } }" > ../hibp_all_05.txt'
# run #1 (ramdisk)
real 310.47
user 277.24
sys 31.71
# run #2 (ramdisk)
real 322.22
user 281.16
sys 32.67
# `sed` performed an order of magnitude worse for me (than any other approach).
/usr/bin/time -p sh -c 'cat ../filenames.txt | xargs -r -I{} sed -r "s#(.*[^\\r])\\r?\$#{}\\1#;\$a\\" "{}" > ../hibp_all_06.txt'
# run #1 (ramdisk)
real 5718.04
user 4959.86
sys 732.44
# Readded cutting of file pathes to the 3rd variant to see how much slower this makes the processing.
/usr/bin/time -p sh -c 'cat ../filenames.txt | xargs -r -d "\\n" awk -F_ "{ print substr(FILENAME, length(FILENAME)-4, 5) \$0 }" | tr -d "\\r" > ../hibp_all_07.txt'
# run #1 (ramdisk)
real 225.38
user 213.45
sys 53.76
# run #2 (ramdisk)
real 225.76
user 211.46
sys 53.21
# Using the GNU Parallel tool (see: https://www.gnu.org/software/parallel/) to split the input processing between multiple instances of the Awk script.
# I had to use a file for the Awk script, because I couldn't make it work with Parallel otherwise.
echo "{ print FILENAME \$0 }" > ../merge.awk
/usr/bin/time -p sh -c 'cat ../filenames.txt | parallel -k -m awk -F_ -f ../merge.awk | tr -d "\\r" > ../hibp_all_08.txt'
# run #1 (ramdisk)
real 257.27
user 259.53
sys 93.95
# run #2 (ramdisk)
real 188.13
user 260.53
sys 94.89
# Removed the "-k" option from `parallel` to see whether it makes a difference if we don't care about output order.
/usr/bin/time -p sh -c 'cat ../filenames.txt | parallel -m awk -F_ -f ../merge.awk | tr -d "\\r" > ../hibp_all_09.txt'
# run #1 (ramdisk)
real 171.47
user 257.66
sys 94.26
# A very simple Python implementation. I'm not a Python expert though, so it is possible that it can be significantly improved.
/usr/bin/time -p sh -c 'cat ../filenames.txt | xargs -r -d "\\n" ../merge.py > ../hibp_all_10.txt'
# run #1 (ramdisk)
real 352.35
user 321.56
sys 25.73
# Running multiple instances of the Python script in parallel.
/usr/bin/time -p sh -c 'cat ../filenames.txt | parallel -k -m ../merge.py > ../hibp_all_11.txt'
# run #1 (ramdisk)
real 204.11
user 785.95
sys 92.09
# A simple Perl implementation, but I'm not a Perl expert either.
/usr/bin/time -p sh -c 'cat ../filenames.txt | xargs -r -d "\\n" ../merge.pl > ../hibp_all_12.txt'
# run #1 (ramdisk)
real 372.66
user 345.37
sys 25.23
# Multiple parallel instances of the Perl implementation.
/usr/bin/time -p sh -c 'cat ../filenames.txt | parallel -k -m ../merge.pl > ../hibp_all_13.txt'
# run #1 (ramdisk)
real 218.11
user 877.60
sys 86.22
# A simple Golang implementation.
# CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o merge_go merge.go
/usr/bin/time -p sh -c 'cat ../filenames.txt | xargs -r -d "\\n" ../merge_go > ../hibp_all_14.txt'
# run #1 (ramdisk)
real 835.03
user 402.81
sys 513.66
# Multiple parallel instances of the Golang implementation.
/usr/bin/time -p sh -c 'cat ../filenames.txt | parallel -k -m ../merge_go > ../hibp_all_15.txt'
# run #1 (ramdisk)
real 404.63
user 811.26
sys 1594.51
Here's the code of merge.py
:
#!/usr/bin/env python3
import sys
import os
# Check if any files were provided as arguments
if len(sys.argv) < 2:
print("Usage: {} file1 file2 ...".format(sys.argv[0]))
sys.exit(1)
# Iterate over each file provided in the command line arguments
for file_path in sys.argv[1:]:
try:
# Open the file for reading
with open(file_path, 'r', encoding='ascii') as file:
# Extract just the filename from the file path
file_name = os.path.basename(file_path)
# Process each line in the file
for line in file:
# Strip the CR+LF (or any new line character)
line = line.rstrip('\r\n')
# Print the filename and the line with only LF at the end
print(f"{file_name}{line}")
except FileNotFoundError:
print(f"Error: File '{file_path}' not found.")
except IOError as e:
print(f"Error reading file '{file_path}': {e}")
Here's the code of merge.pl
:
#!/usr/bin/env perl
use strict;
use warnings;
# Check if any files were provided as arguments
if (@ARGV == 0) {
die "Usage: $0 file1 file2 ...\n";
}
# Iterate over each file provided in the command line arguments
foreach my $file_path (@ARGV) {
# Open the file for reading
open(my $fh, '<', $file_path) or die "Could not open file '$file_path': $!\n";
# Extract just the filename from the file path
my ($file_name) = $file_path =~ m|([^/\\]+)$|;
# Process each line in the file
while (my $line = <$fh>) {
# Remove the CR+LF at the end of the line
chomp $line;
$line =~ s/\r$//; # Removes the carriage return if present
# Print the filename and the line with only LF at the end
print "$file_name$line\n";
}
# Close the file handle
close($fh);
}
And here's the code of merge.go
:
package main
import (
"bufio"
"fmt"
"os"
"path/filepath"
"strings"
)
func main() {
// Check if any files were provided as arguments
if len(os.Args) < 2 {
fmt.Printf("Usage: %s file1 file2 ...\n", filepath.Base(os.Args[0]))
os.Exit(1)
}
// Iterate over each file provided in the command line arguments
for _, filePath := range os.Args[1:] {
// Open the file
file, err := os.Open(filePath)
if err != nil {
fmt.Printf("Error: Could not open file '%s': %v\n", filePath, err)
continue
}
defer file.Close()
// Extract the filename from the file path
fileName := filepath.Base(filePath)
// Use bufio.Scanner to read the file line by line
scanner := bufio.NewScanner(file)
for scanner.Scan() {
// Get the line and remove any CR+LF (Windows-style newlines)
line := scanner.Text()
line = strings.TrimRight(line, "\r")
// Print the filename and the line with only LF at the end
fmt.Printf("%s%s\n", fileName, line)
}
// Check if there were any errors during file reading
if err := scanner.Err(); err != nil {
fmt.Printf("Error reading file '%s': %v\n", filePath, err)
}
}
}
It seems that the best performance came from this:
cat ../filenames.txt | xargs -r -d "\\n" awk -F_ "{ print FILENAME \$0 }" | tr -d "\\r" > ../hibp_all_03.txt
So the best commandline seems to be this (this one must be executed in the directory where the HIBP files were downloaded):
find . -type f -printf "%f\\n" | egrep -a "^[0-9A-F]{5}\$" | LC_ALL=C sort | xargs -r -d "\\n" awk -F_ "{ print FILENAME \$0 }" | tr -d "\\r" > hibp_all.txt
Great stuff... I particularly like the sort of filenames before merge, that's essential!
You may also be interested in a little project I have been working on which is all about working with this dataset.
https://github.com/oschonrock/hibp
The key is... it uses "binary storage", which means the on disk size is halved. And binary means that each "record" is the same size, 24 bytes, and that means we can use "binary search". So it does that and provides a "local http server" you can connect to for queries.
The hibp_download
er uses libcurl
with curl_multi
under the hood, which is what curl --parallel
uses, so it benefits from libcurl
's great implementation of HTTP2 multiplexing, which makes this great speed possible. I get about a 13min download from a cheap Virtual Machine on Fasthosts. It buffers up to 300 (by default, but tunable) text file contents in memory, so that it can write straight to a continuous file on disk, while keeping the order correct! hibp_download
uses a second thread to convert the txt into the binary format and write to disk, so it keeps the network busy and bloaty text versions never see the disk. The 13min includes download, ordering, convert to binary, and write to disk as a single file.
hibp_server
is a very lean server with only ~5MB resident memory footprint and can serve ~1kreq/s on a single and old CPU assuming a common consumer grade SSD.
All the tools are very lean on memory and disk space. All you need is enough disk space for one copy of the binary DB (current ~ 21GB) and a tiny amount of memory (about ~50MB during download, and ~5MB during serve). My goal was to make this feasible on a "small" size Virtual Machine, which you can get for a few $ per month, say 50GB SSD, 2 cores, and 2GB memory.
I have build instructions for Linux, FreeBSD and Windows, and it is tested on those. If this is popular I am happy to maintain packages.
Feedback very welcome!
Great project! :)
My goal was to make this feasible on a "small" size Virtual Machine, which you can get for a few $ per month, say 50GB SSD, 2 cores, and 2GB memory.
You can get a VM with 4 vcpus, 24 GB RAM and 50 GB storage (don't remember whether it's HDD or SSD) for free from Oracle Cloud. :) I have been using this for over a year now. They don't even charge you for egress traffic (Google does even in its "always-free" plan) ... or at least not at the amount of traffic I'm generating.
I will check that out. Interesting that disk size is the limiting factor. This DB is so large that even just one (text) copy of it almost fills that VM. And your curl
+ find
+ awk
combiner temporarily requires 2 copies on disk which blows the storage budget.
If you had time to build my project and try it out on your VM, I would really appreciate it, and the feedback will help improve it!
Thanks!
# A simple Golang implementation. # CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o merge_go merge.go /usr/bin/time -p sh -c 'cat ../filenames.txt | xargs -r -d "\\n" ../merge_go > ../hibp_all_14.txt' # run #1 (ramdisk) real 835.03 user 402.81 sys 513.66
I am amazed the go version didn't do better. Much slower than ptyhon
, perl
and find
+awk
. Something not right there...?
In any case, what I realised when coding my project is there are 2 slow things here. Network and Disk. We want to use disk as little as possible and write to disk while we are downloading, so that time is essentially for free. My code also does the convert to binary and that is free too.
I am amazed the go version didn't do better. Much slower than ptyhon, perl and find+awk. Something not right there...?
I've found this interesting too. No idea why it came out this way. Didn't spend any time on optimizing the Python/Perl/Go implementations for performance.
The key is... it uses "binary storage", which means the on disk size is halved. And binary means that each "record" is the same size, 24 bytes, and that means we can use "binary search".
Storing the data in a proprietary binary format is not necessarily an advantage. It consumes less space and can be more efficient to search, but requires custom code/tooling to process, while a text-based format can be more easily processed with a number of tools. There're pros and cons on both sides.
Storing the data in a proprietary binary format is not necessarily an advantage. It consumes less space and can be more efficient to search, but requires custom code/tooling to process, while a text-based format can be more easily processed with a number of tools. There're pros and cons on both sides.
That's true of course. I can easily add a switch to spit out the text version. It would still be faster and use half the disk space (because no need for temp copy). Let me know if that is of interest?
What code/tooling exists to do something useful with the text file?
That's what I never understood. I started using this dataset many years ago, and used to insert it into a mysql DB with with the SHA1 as the primary index, and I used to convert it into binary on the way into the db. That process of importing was super slow and almost started breaking as the data grew, and the queries were like avg 45ms, quite slow? That's why I started the with this custom tool, for querying, and later (because of your curl idea) for the download as well.
What do you do with 38GB of text?
Any script, language, etc can make an http request, that's why i chose that. But a "php extension" or python/js module with c backend could also be good.
If you need the hashes for continuous use (e.g. integrated into an authentication system), then I agree that storing it in a performance-optimized format makes sense.
If you need it just for a one-off examination, research, etc., then dealing with a custom storage format might be an unnecessary overhead.
What do you do with 38GB of text?
As demonstrated: you can grep it or write a shell/awk script to process it, etc. Usually I can do some light processing (statistics, etc.) faster in shell scripting or Awk, then writing any program code (Python, Go, etc.) for the same task. Of course if I want something for a longer period of use, a performance optimized solution is better.
It depends on your use case.
As demonstrated: you can grep it or write a shell/awk script to process it, etc.
I agree, of course. It's the UNIX way. I just literally can't think of a use case for this data, other than "look up a password and get the count". The SHA1 is non-intelligible by design.
Your use case above, is literally "download the data and combine it", ie "just getting the data in the first place". I can't think of another one.
But I guess you can never imagine all use cases.. I will add a "text export" switch.. both during download, and also if you already have the binary version. That way you never have to "store the 38GB of text", unless you specifically want to, and can just pipe it to the next process.
Just working on "resuming downloads"... some users have reported that their internet is so flaky they can't get through the 38GB without a failure (even with retries), so "resume" does seem useful, and is not so hard. Will add --text-export. after that.
Troy has made such a fabulous resource available here, but unless you are happy with his online api (security questions...), then using it is actually quite hard: The .NET downloader, the file size, what to query it with... That's what I was trying to solve.
@muzso
FYI..
hibp_download --text-out
and
hibp_download --resume
now both work. Although not together... ie you can't resume a text download.
Hi!
Thanks for HIBP and this downloader. At first I was considering using it, but the API of HIBP passwords is so easy that I wrote a small shell script for it. It was slow as hell, because it had no parallelism at all. It was far too slow for my taste, thus I was thinking about adding parallelism. And that's when I stumbled on curl's URL globbing feature.
curl is the swiss army knife of HTTP downloading and it supports patterns/globbing and massive parallelism, and pretty much every aspect of HTTP downloads (proxies, HTTP1/2/3, all SSL/TLS versions, etc.).
Here's a single curl commandline that downloads the entire HIBP password hash database into the current working directory:
If you want to be able to check the curl output/log, remove the
-s
option and redirect to a file:To debug issues, you can add the
-v
(or the--trace-ascii -
) option to increase verbosity. By default curl does not retry any requestsOn an always-free Oracle Cloud VM this finished (for me) in 13.5 minutes. ;-)
The URL globbing and the
--remote-name-all
options have been around in curl for ages (i.e. for over a decade) and the--parallel*
options have been added in Sep 2019 (v7.66.0). So pretty much all "recent" Linux distros already contain a curl version that fully support this commandline.Curl is cross-platform, e.g. you can download a Windows version too.
If you don't have the necessary 30-40 GB free space for the entire hash dump, you can get away with less by downloading in smaller batches and instantly compressing them.
Here's a command that you can fire and forget (i.e. disconnect / log out) on any Linux PC/server and it'll do the job in batches:
This doesn't support suspend and resume of the download job (other HIBP downloaders do), but since it finishes pretty quickly (if you have a good enough internet connection), I don't see any reason for this feature.
You can easily assemble the server responses into a single ~38 GB file with the following commandline (on Linux):
You can sort it easily based on the second field to get the most "popular" hashes:
To get just the most popular hashes:
You can feed these into a pre-computed lookup table (like what crackstation.net provides) to get the most popular breached passwords. :)