andrewlock / PwnedPasswords

An ASP.NET Core Identity validator that checks for PwnedPasswords
MIT License
103 stars 12 forks source link

Loading from files is far too slow #1

Closed andrewlock closed 5 years ago

andrewlock commented 6 years ago

If you have the full 12GB file, then searching through the file takes far too long to be acceptable in a web environment.

Could look at parallelism, but I suspect that won't help the problem much. May have to enforce loading into memory (big memory issue there) or add a DB service (I like this idea as an option anyway).

SeanFarrow commented 6 years ago

Hey Andrew,

Working on the database as that's what we're using, can't use files as we're writing a distributed application. The only issue I have is do we mae ef core a part ofthis or split the lib in to 3 projects (contracts, core and ef)? Regarding the reading of files I have an idea so will try to spike something tomorrow. Do you want me to fork, or are you happy to add me as a collaborator?

andrewlock commented 6 years ago

@SeanFarrow Just added you as a collaborator :)

I'd be interested to hear the exact use cases you have for this. In my mind I was thinking of the following:

  1. The user just wants to call the HaveIBeenPwned API directly - no need to mess with the data, but limited by the 1500ms per call. Probably ok for most web apps
  2. User wants to deploy the password files with the app and access directly, to avoid calling the API. Problem is the size of the files makes them prob untenable for loading into memory. Interested to hear your thoughts on that one!
  3. Load the data into a DB and query directly using a custom IHaveIBeenPwnedService implementation.
  4. Call a "mirror" of the HaveIBeenPwned API from the validator.

From my point of view, I'm most interested in the last point. I would stand up a simple ASP.NET Core web app that mirrors the semantics of Troy's public API, but is just reading from a DB etc. This would work best from the distributed app point of view, and means the Validator service doesn't need knowledge of any particular API etc.

I think point 4. is prob also what you're after, though you're maybe describing 3.?

In terms of architecture, I think if you want to have EF specific code in there, it will have to be a separate package. A bit annoying as we end up with many different projects, but prob the only way to avoid escalating dependencies.

I'll try and put together the example mirror API in the next couple of days. I'll prob make it a simple middleware so that you can easily just drop it into any app, wire up an IHaveIBeenPwnedService and have everything just work.

andrewlock commented 6 years ago

Also, RE your comment about the strategy pattern - that's pretty much just being achieved using DI atm. Just register the IHaveIBeenPwnedService implementation (strategy) that you want, and it'll get injected into the Validator.

SeanFarrow commented 6 years ago

Currently looking at point 3, although 4 could be interesting! I’ve got several api’s so it’s another microservice to deploy, plus given we’re using identity we already have a database.

From: Andrew Lock [mailto:notifications@github.com] Sent: Monday, September 18, 2017 14:40 To: andrewlock/HaveIBeenPwnedValidator HaveIBeenPwnedValidator@noreply.github.com Cc: Sean Farrow sean.farrow@seanfarrow.co.uk; Mention mention@noreply.github.com Subject: Re: [andrewlock/HaveIBeenPwnedValidator] Loading from files is far too slow (#1)

@SeanFarrowhttps://github.com/seanfarrow Just added you as a collaborator :)

I'd be interested to hear the exact use cases you have for this. In my mind I was thinking of the following:

  1. The user just wants to call the HaveIBeenPwned API directly - no need to mess with the data, but limited by the 1500ms per call. Probably ok for most web apps
  2. User wants to deploy the password files with the app and access directly, to avoid calling the API. Problem is the size of the files makes them prob untenable for loading into memory. Interested to hear your thoughts on that one!
  3. Load the data into a DB and query directly using a custom IHaveIBeenPwnedService implementation.
  4. Call a "mirror" of the HaveIBeenPwned API from the validator.

From my point of view, I'm most interested in the last point. I would stand up a simple ASP.NET Core web app that mirrors the semantics of Troy's public API, but is just reading from a DB etc. This would work best from the distributed app point of view, and means the Validator service doesn't need knowledge of any particular API etc.

I think point 4. is prob also what you're after, though you're maybe describing 3.?

In terms of architecture, I think if you want to have EF specific code in there, it will have to be a separate package. A bit annoying as we end up with many different projects, but prob the only way to avoid escalating dependencies.

I'll try and put together the example mirror API in the next couple of days. I'll prob make it a simple middleware so that you can easily just drop it into any app, wire up an IHaveIBeenPwnedService and have everything just work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/andrewlock/HaveIBeenPwnedValidator/issues/1#issuecomment-330224279, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABY1ftSsY0VnV_i1TRn7Ym0qc-_O9wsGks5sjnKogaJpZM4PaT1e.

SeanFarrow commented 6 years ago

Also, could you not use File.ReadLines and concat all lines in all files using linq, surely this would be usable?

andrewlock commented 6 years ago

The problem is that the files are 12GB - you can't really read that all that into memory for a web app (I physically can't on my dev machine, only 16GB memory!). Having said that, the file approach clearly doesn't work with files that size either!

SeanFarrow commented 6 years ago

Ok, should we can that and just have an api with a database behind it? Either that or have a service as we have now with a strategy (which OOTB is a database). That way if people really want to implement the file based approach, with all its caveats, they can.

From: Andrew Lock [mailto:notifications@github.com] Sent: Monday, September 18, 2017 18:59 To: andrewlock/HaveIBeenPwnedValidator HaveIBeenPwnedValidator@noreply.github.com Cc: Sean Farrow sean.farrow@seanfarrow.co.uk; Mention mention@noreply.github.com Subject: Re: [andrewlock/HaveIBeenPwnedValidator] Loading from files is far too slow (#1)

The problem is that the files are 12GB - you can't really read that all that into memory for a web app (I physically can't on my dev machine, only 16GB memory!). Having said that, the file approach clearly doesn't work with files that size either!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/andrewlock/HaveIBeenPwnedValidator/issues/1#issuecomment-330306140, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABY1flsH78kfgvfdUnq9NOyeGDYetP6pks5sjq90gaJpZM4PaT1e.

andrewlock commented 6 years ago

Yeah, I think having the Validator use the API as the default option is best for the basic package. A separate package that implements an IHaveBeenPwnedApiService with a database on top of that covers use case 3, and then a separate package for creating your own HaveIBeenPwned API (that can be backed by a DB too).

SeanFarrow commented 6 years ago

Ok, I’ll start work on the database. I’ll let you start work on the api.

From: Andrew Lock [mailto:notifications@github.com] Sent: Tuesday, September 19, 2017 19:30 To: andrewlock/HaveIBeenPwnedValidator HaveIBeenPwnedValidator@noreply.github.com Cc: Sean Farrow sean.farrow@seanfarrow.co.uk; Mention mention@noreply.github.com Subject: Re: [andrewlock/HaveIBeenPwnedValidator] Loading from files is far too slow (#1)

Yeah, I think having the Validator use the API as the default option is best for the basic package. A separate package that implements an IHaveBeenPwnedApiService with a database on top of that covers use case 3, and then a separate package for creating your own HaveIBeenPwned API (that can be backed by a DB too).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/andrewlock/HaveIBeenPwnedValidator/issues/1#issuecomment-330629736, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABY1flfSPe2j9NCyQVn7uYBjoRv23LPtks5skAgXgaJpZM4PaT1e.

andrewlock commented 6 years ago

@SeanFarrow I've created a PR #4 to add an API server that mirrors the HaveIBeenPwnedAPI. I had to rearrange things a bit. Did you want to take a look before I merge?

SeanFarrow commented 6 years ago

Hi,

Yes, please, give me a couple of days, just reformatting a machine!

From: Andrew Lock [mailto:notifications@github.com] Sent: Sunday, October 01, 2017 19:38 To: andrewlock/HaveIBeenPwnedValidator HaveIBeenPwnedValidator@noreply.github.com Cc: Sean Farrow sean.farrow@seanfarrow.co.uk; Mention mention@noreply.github.com Subject: Re: [andrewlock/HaveIBeenPwnedValidator] Loading from files is far too slow (#1)

@SeanFarrowhttps://github.com/seanfarrow I've created a PR #4https://github.com/andrewlock/HaveIBeenPwnedValidator/pull/4 to add an API server that mirrors the HaveIBeenPwnedAPI. I had to rearrange things a bit. Did you want to take a look before I merge?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/andrewlock/HaveIBeenPwnedValidator/issues/1#issuecomment-333396973, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABY1fg854lNpC9Xb3Qmi94zq1fpsi15bks5sn9vwgaJpZM4PaT1e.

andrewlock commented 6 years ago

Hi @SeanFarrow, it's been a long while since we looked at this!

Not sure if you've kept up to speed with Troy's changes and the anonymous endpoint? If not, it means passwords are never sent to the API server, there's no rate limit, and the speed is incredibly quick (<20ms).

For me, that completely removes the need for the file/database APIs, and I'm thinking that will be true for most people. Frankly, I can't see the speed of a file-based service ever being sufficient, and the complexity of a database-based service being worth the (questionable) anonymity, performance or reliability gains.

I'm wondering how you feel about it? The project's been languishing here for a while, and updating to just wrap the API would definitely simplify things, and remove some of the issues. I'd like to push up a release to NuGet this weekend if possible

SeanFarrow commented 6 years ago

Hi,

I definitely agree. Also, given that 2.1 is out and the benefits that brings with polly and retries, it makes sense to just create an API client to wrap the api. I’ve been meaning to do this for a while, but am currently with a client and heading to Norway for NDC, so time is limited. I did look at writing some code to splitthe the file as troy does, but it took an age to run. Hope that clarifies things.

From: Andrew Lock notifications@github.com Sent: 03 June 2018 06:47 To: andrewlock/HaveIBeenPwnedValidator HaveIBeenPwnedValidator@noreply.github.com Cc: Sean Farrow sean.farrow@seanfarrow.co.uk; Mention mention@noreply.github.com Subject: Re: [andrewlock/HaveIBeenPwnedValidator] Loading from files is far too slow (#1)

Hi @SeanFarrowhttps://github.com/SeanFarrow, it's been a long while since we looked at this!

Not sure if you've kept up to speed with Troy's changes and the anonymous endpoint? If not, it means passwords are never sent to the API server, there's no rate limit, and the speed is incredibly quick (<20ms).

For me, that completely removes the need for the file/database APIs, and I'm thinking that will be true for most people. Frankly, I can't see the speed of a file-based service ever being sufficient, and the complexity of a database-based service being worth the (questionable) anonymity, performance or reliability gains.

I'm wondering how you feel about it? The project's been languishing here for a while, and updating to just wrap the API would definitely simplify things, and remove some of the issues. I'd like to push up a release to NuGet this weekend if possible

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/andrewlock/HaveIBeenPwnedValidator/issues/1#issuecomment-394138910, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABY1flznd49LOMXfmkwqeZGPCc1Bi5Thks5t43htgaJpZM4PaT1e.

andrewlock commented 6 years ago

Cool, that was exactly my thoughts too 👍 I'll put something together in the next couple of days. Hope you enjoy NDC, I'm sure it'll be excellent!

andrewlock commented 6 years ago

Put together a typed client for the PwnedPaswords api, and stripped out all the file/db side: #6

SeanFarrow commented 6 years ago

Ok, I’ll review tonight, we can then release tomorrow, hope this is OK?

From: Andrew Lock notifications@github.com Sent: 04 June 2018 15:14 To: andrewlock/HaveIBeenPwnedValidator HaveIBeenPwnedValidator@noreply.github.com Cc: Sean Farrow sean.farrow@seanfarrow.co.uk; Mention mention@noreply.github.com Subject: Re: [andrewlock/HaveIBeenPwnedValidator] Loading from files is far too slow (#1)

Put together a typed client for the PwnedPaswords api, and stripped out all the file/db side: #6https://github.com/andrewlock/HaveIBeenPwnedValidator/pull/6

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/andrewlock/HaveIBeenPwnedValidator/issues/1#issuecomment-394369294, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABY1frHHc1u92Zkx75ZRoo5MCm1tHbUfks5t5UCXgaJpZM4PaT1e.

andrewlock commented 6 years ago

Yeah sure, no rush 🙂

SeanFarrow commented 6 years ago

@AndrewLock, All looks good, so feel free to merge/release. I can't seem to find where this is used in the sample app though. There are a few spelling mistakes in comments, but they can be fixed after release--yes, I'm being picky.

It would be good to show the use of Polly in the sample app, again this can be done after the release.

SeanFarrow commented 5 years ago

Given that we query the API, should we close this?

andrewlock commented 5 years ago

Yeah, agreed, I'll close 👍