Provide a utility to extract the downloadable password file

SeanFarrow commented 5 years ago

We need to provide a utility to extract the downloadable file containing the list of pwned passwords and counts.

This should ideally be written in .Net Core 3 to take advantage of C# 8's IAsyncEnumerable feature.

We should produce a single file per password prefix.

It should also be runnable in docker with a configurable output folder to allow placing the created files on a file system outside the docker container.

andrewlock commented 5 years ago

I'm not sure I understand this tbh! 🤔

We need to provide a utility to extract the downloadable file containing the list of pwned passwords and counts.

Isn't that what HaveIBeenPwned provides, letting you download the files? If we want to provide a file-based approached instead of the API approach, I think we could achieve that using a Bloom Filter, like I was looking at in a branch. That can massively compress the amount of space required at the expense of occasional false positives (i.e. password deemed pwned when it actually hasn't been).

SeanFarrow commented 5 years ago

If we are going to provide a locally available API to allow people to check whether a password is pwned, we need to provide a way to extract/split the passwords available in the downloadable zip file at: https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-count-v4.7z

I’m proposing we write a utility to do this as per the original issue.

Does that make more sense?

From: Andrew Lock notifications@github.com Sent: 22 April 2019 22:07 To: andrewlock/PwnedPasswords PwnedPasswords@noreply.github.com Cc: Sean Farrow sean.farrow@seanfarrow.co.uk; Author author@noreply.github.com Subject: Re: [andrewlock/PwnedPasswords] Provide a utility to extract the downloadable password file (#10)

I'm not sure I understand this tbh! 🤔

We need to provide a utility to extract the downloadable file containing the list of pwned passwords and counts.

Isn't that what HaveIBeenPwned provides, letting you download the files? If we want to provide a file-based approached instead of the API approach, I think we could achieve that using a Bloom Filter, like I was looking at in a branch. That can massively compress the amount of space required at the expense of occasional false positives (i.e. password deemed pwned when it actually hasn't been).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/andrewlock/PwnedPasswords/issues/10#issuecomment-485552980, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AALDK7RHYTUHLGCB27TYCQDPRYSH3ANCNFSM4HHSNILA.

andrewlock commented 5 years ago

Ah ok, I see what you mean now. I'm just not sure this is the way to go!

Even testing with small files initially, the performance of loading from a file for every request was way too low. Fragmenting the file will give you faster random access, but I'm sceptical it will be sufficient.

I feel like the bloom filter approach is more practical as it reduces the data usage by orders of magnitude, so you can have the whole thing in memory. There are trade-offs for that, but it still seems like the preferable approach. I already wrote the tool for loading the data into a bloom filter, but never finished it off, due to strange behaviour at some false positive rates.

That said, obviously feel free to give it a go! 🙂

SeanFarrow commented 5 years ago

Is there a branch with your bloom filter code on? I know troy has gone down the file root, see: here, particularly the heading Table Storage Versus Blob Storage sso initially figured I'd replicate that.

If the bloom filter approach is more performant, then I'm happy to go with that, but I'd like some definitive numbers/benchmarks either way.

We may also want to think about caching the file contents using standard http cache constructs.

andrewlock commented 5 years ago

Is there a branch with your bloom filter code on? https://github.com/andrewlock/PwnedPasswords/tree/multi-bloom-filter

It's very rough and ready, was me playing, not for public consumption stuff!

I know troy has gone down the file root, see: here, particularly the heading Table Storage Versus Blob Storage sso initially figured I'd replicate that.

There's a big difference between local file system and Blob storage though, and the costs of storing this all in blob storage etc may make it prohibitive. Not sure... Obviously it's an option, if you/others will find it useful then who am I to say no! 😃

If the bloom filter approach is more performant, then I'm happy to go with that, but I'd like some definitive numbers/benchmarks either way.

I can say categorically it will be more performant, but can't give you numbers. Also it's not like for like, bloom filters have a FP + for effeciency you'd prob only add a subset of PW, e.g. those that appear at least twice etc

We may also want to think about caching the file contents using standard http cache constructs.

I don't think caching will get you far here unless you're working at Troy's scale. You'll likely (by design) end up with a very low cache hit ratio, so just adding complexity and layers for the very rare case you get a hit. But hard to say without measuring!

andrewlock / PwnedPasswords

Provide a utility to extract the downloadable password file #10