Create a design proposal for the raw doc fetcher

mcsaucy commented 4 years ago

TL;DR

store raw docs in a (content-addressed, maybe?) service
define a manifest that expresses the relationship between those docs
sign the manifest with a trustworthy key to anchor our data lineage
store the manifest in a database or something (not really the focus of the doc)

Why are we doing this?

So we have an agreed-upon path for some future work and so it's easier for folks to ramp up and contribute to this. That said, if this is approved by everyone and we change our minds, there's nothing stopping up from throwing this out or updating it.

Things to look out for

does this dovetail well with ongoing work?
will this cause any problems now or down the road?
will this do what we need it to do?
will this scale?

Meta

I figure a PR is better than a Google Doc for archaeology purposes (so we can look back and see exactly who correctly said "I told you so").

I'm going to throw some reviewers onto this. If you're a reviewer, please suggest changes as needed (or approve if you're content). Once everyone has approved and the discussion has stabilized to the point where everyone seems happy, we can merge this in.

If you aren't a reviewer and happen to see this, please feel free to chime in all the same!

nfrostdev commented 4 years ago

This is great! I'm trying to get more eyes on it to see if anyone has anything to nitpick.

danmelles commented 4 years ago

Overall thought: I think that creating a cataloguing system is the most important element here. This is because much of the challenge we're trying to address is organization of data, not as much fetching it. In theory, people could grab data manually and upload it to a data lake. Some of these issues are definitely addressed here. The reason I bring this up is that I think it's an important enough issue to warrant its own design doc, separate from anything related to scraping. Thoughts?

GitGerby commented 4 years ago

Asserting the authenticity of the data is the challenge here, processing and cataloging can be done as needed to present the underlying records in a useful manner. Ensuring that the data is correct and free from manipulation is paramount, compromised source materially would fundamentally undermine the entire project.

On Sat, Jun 27, 2020, 15:57 Josh McSavaney notifications@github.com wrote:

@mcsaucy commented on this pull request.

In doc/design/record_fetcher_proposal.md https://github.com/Police-Data-Accessibility-Project/Scrapers/pull/29#discussion_r446561685 :

-Since this design requires documents to be uploaded before writing out a manifest, it's possible that we'll upload docs

-and then fail (or neglect) to upload a manifest. This results in orphaned documents, which are just wasted storage.

-To mitigate this, we'll construct a GC batch job that runs on a schedule and removes documents that are orphaned and

-have existed for longer than X time.

+Signed manifests and all related documents will be uploaded together as [MIME Multipart messages](

The idea of not trusting PDAP-controlled storage is a bit concerning to me. Hopefully, you mean you don't trust what is stored there, which is more reasonable.

Yeah, sorry about that. "Not trusting storage" has implications I didn't mean to convey. From an insider risk angle (or just misconfiguration) angle, I'm hesitant to assume that nothing in our storage has been tampered with. I think what I'm mostly shooting for is a model where we largely trust our storage and its contents, but are able to scalably verify they are trustworthy.

EDIT: Just as an additional note, security is not really my wheelhouse.

My present work is security adjacent, but this is still largely not my wheelhouse either. 😄 I may pop a link to this in the infosec chat, since the discussions surrounding this has drifted into that territory. On the topic of things that aren't wheelhouses, I'm not well-versed in what the actual landscape looks like here (e.g., is there something in AWS that makes some aspect of this moot?), so there could be some assumptions I'm incorrectly making/not-making.

If we can't trust what is running within our infrastructure, how can we trust the infrastructure managing JWT tokens and their authentication?

We have to root our trust somewhere, but we can confine it to a something that is extremely locked down and audited (such as the key infra). The strawman in the appendix addresses some of this, but let's say:

fetcher generates a keypair (private key stays in memory so it's lost on process exit)

fetcher registers the public key and provenance info (repo, entrypoint, commit, etc) with infra, which logs that data in a DB (we'll call this abstract thing the keydb) and returns a JWT bearer token for auth

manifests are signed with generated key

upload endpoint ensures JWT token is legit, validates the manifest's signature with the public key in the JWT bearer token

I'd love to learn more and understand how this makes it safer, or if the goal is more being able to track exactly where data came from and what processed it.

I'm shooting for both.

When it comes to safety, the following scenarios can play out if we get owned:

manifest/doc storage compromised:

action: attacker manages to get write access our manfiest/document store and changes a bunch of things

response: we'd verify the signatures of all manifests against what we have in the keydb and find anything that was inserted or tampered with (no solution for deletions)

fetcher environment compromised:

action: attacker can run code in the fetcher environment to upload bad record data

response: we can identify the key used to upload the bad data and revoke the key in the keydb, causing verifications to fail

fetcher's key material leaks:

action: a fetcher's private key is exfil'd and it's no longer actually private

response: revoke the key in the keydb, causing verifications to fail

keydb is compromised

action: attacker can write arbitrarily to the keydb

response: if you can identify the tampering, revoke the relevant entries. If not, blow it away, restore from more trusted backups, re-fetch what you need to re-fetch.

As for tracking origin, this would allow us to confidently say "this data came from this scraper at this revision and this time", and then other things can build off of that (with processes which are ideally also verifiable).

Thoughts?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Police-Data-Accessibility-Project/Scrapers/pull/29#discussion_r446561685, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIX4FPNL356CN6YTYJCQL33RYZFJRANCNFSM4ODVLWWQ .

Police-Data-Accessibility-Project / scrapers