freelawproject / recap

This repository is for filing issues on any RECAP-related effort.
https://free.law/recap/
12 stars 4 forks source link

Make API and begin counting usage stats of the other reports on PACER #246

Open mlissner opened 6 years ago

mlissner commented 6 years ago

Right now, we can handle it if people use the docket report or the docket history report, but we pretty much ignore all the other reports. We should just set up dumb APIs for each of them so we can start figuring out how much people use them. The API doesn't have to do anything — just log that it was hit.

We could probably even do this — if we really wanted to — by using Piwik (our web analytics tool). It has an API for things like this. If it had another site for uscourts.gov, I bet we could just embed its script on these pages and the info we need. Or we could roll something simple ourselves....

This would be nice to have though, for sure.

mlissner commented 1 year ago

Well, we're using plausible now, and I kind of feel like doing this isn't a horrible idea. It looks like all we'd have to do is set up a new site in Plausible, then include their tracking snippet via the extension: https://plausible.io/docs/plausible-script

We'd want to be careful about how this affects our privacy policy and whether people would find this upsetting. Nobody likes being tracked, but the nice thing about plausible is that it is really just a stats system. We'd see the URLs people went to on PACER, and their usage stats, but not who went to those URLs or even where that person went next. It doesn't even use cookies.

Here's the full list of what it collects:

mlissner commented 1 year ago

I posted on Twitter about this here: https://twitter.com/FreeLawProject/status/1587560343240249345. We'll see if it gets much response.

dkg commented 1 year ago

i agree with the overall goal of wanting to collect this kind of information for the RECAP project. Finding "hot spots" that are currently not covered but should be prioritized is useful. Maybe it would be helpful, though, to talk a bit more concretely about what kind of risks this collection might pose to the users of the RECAP extension.

One way of approaching this question is a thought experiment: if Plausible were hacked and the data it has access to were leaked to a malicious adversary, what could be learned about the RECAP users? For example, if Alice knows Bob's lawyer's IP address, and she gets access to the data available to Plausible, could she learn something about Bob's legal strategy in an upcoming case that Bob's lawyer should have been protecting?

I'm a bit surprised to hear that RECAP is using a third-party cloud analytics platform in the first place, though -- if you've got a piwik instance up why not just send data there, so that RECAP contains full control over the data?

The list of things collected doesn't look quite right -- maybe you mean things retained? i'd assume that the things "collected" by Plausible are the full HTTP headers and body, and the IP metadata of the connection (e.g. client IP address). Then Plausible probably converts the collected IP address via some geoip thing to Country, Region, City; and it then converts the User-Agent header string to "Browser, Operating System and Device Type"

It's great that no cookies are necessary, but it's probably also worth noting that if they wanted to, Plausible could presumably also group submitted URLs by TLS session, so even without cookies they have a potential line into some amount of cross-URL linkability, as long as the user's browser is reusing TLS sessions (e.g. via session tickets), which is pretty common for efficiency in both computation and latency.

mlissner commented 1 year ago

I'm a bit surprised to hear that RECAP is using a third-party cloud analytics platform in the first place, though -- if you've got a piwik instance up why not just send data there, so that RECAP contains full control over the data?

Yeah, we used ~Piwik~ Matomo for about ten years, but we stopped using it after about 10 years earlier this year. At our scale, it looked like it would cost around $500/month to run it, just in AWS fees, and then it was also a constant hassle to keep it up and running. We figured Plausible would be more secure since it'd be updated by professionals, it'd be cheaper, and it'd be better for users, since Plausible collects so much less data. More details here: https://github.com/freelawproject/courtlistener/issues/1962

Anyhow, for your hypothetical:

For example, if Alice knows Bob's lawyer's IP address,

I suppose this is feasible using an image in an email sent to Bob's lawyer, that Bob's lawyer's email client downloaded, so I'm with you so far...

and she gets access to the data available to Plausible

This seems extremely unlikely, but I'll grant it for the thought experiment...

Could she learn something about Bob's legal strategy in an upcoming case...

Yes, I think so, but even still, not really? Alice could learn the case documents that Bob's lawyer was downloading and maybe the dockets too, but I don't think that's really enough to give much of an edge to Alice.

...that Bob's lawyer should have been protecting?

I'm no expert on data duty requirements lawyers have, but I'm pretty sure that using online tools is allowed. I'm not sure how this is different from using gmail — which lawyers do — and having gmail get hacked. That'd reveal a ton, but I don't think that people would say to Bob's lawyer that he "should have been protecting" something differently.

The list of things collected doesn't look quite right -- maybe you mean things retained?

Yeah, "collected" is their word from the policy linked above. They should probably be more clear, but I agree, they surely get the IP addresses and HTTP headers, it's just that they throw them away.

They say:

The raw data IP address and User-Agent are never stored in our logs, databases or anywhere on disk at all.

So if you're Alice and you hacked their server and wanted to do bad things, you could watch the TCP/IP traffic fly by, but at least old stuff isn't around. So you could monitor that, at least until Plausible fixed the hack (which could be a while, not denying that).

It's great that no cookies are necessary, but it's probably also worth noting that if they wanted to, Plausible could presumably also group submitted URLs by TLS session,

True. I think the point here is that they don't and that their whole brand is focused on not doing things like this. I think the answer here is that we are inviting a trusted third party into the conversation, and we have to trust that they can keep things secure and that they won't ruin their brand by doing invasive things. Of course, if we don't trust that, then we shouldn't do this.

But I also think that generally, short of Plausible being hacked, the risk is really low here. Even if it is hacked by a litigant, all historical data is anonymous, so even in that extreme case, the only risk is Alice discovering Bob's lawyers URLs until she gets kicked out of Plausible's servers. Seems like an acceptable risk to me.

Thanks for the thought experiments. I've got my rebuttal hat on, but happy to continue the conversation.

dkg commented 1 year ago

Thanks for the thoughtful considerations! I didn't mean to put you into "rebuttal" mode, as i'm not arguing for or against the feature. There are obviously benefits to RECAP to try to collect this kind of data, i'm just trying to understand what the tradeoffs are. Your responses are useful in sorting that out.

fwiw, "Plausible being hacked" is a circumstance that seems unlikely in general for any given user of RECAP, but the more interesting and sensitive data that Plausible has access to, and the wider range of adversaries that we consider (including all adversaries of every lawyer with the RECAP extension installed), the higher the probability that someone with Plausible who has sysadmin keys has an incentive to misbehave :/ Same is true for AWS, for that matter, if that's where your Matomo instance was hosted.

I agree that this does seem comparable to a lawyer using some untrustworthy e-mail provider, and that many lawyers do that, but if the bar is just "don't be any worse than the sloppiest lawyer" then it is a low bar indeed -- and a reason for every other player to never fix their setup ("why bother improving? even RECAP leaks this kind of information!")

As for the use of cookies or the TLS session re-establishment cookie-like feature for linkability, that ought to be something that could be controlled (or at least detected) on the client side, in the browser extension. If Plausible says that they don't use that kind of tracking, then maybe there's a way for the extension to confirm or enforce that claim? (e.g. directing the browser to use "Incognito"/"Private Browsing" mode for connections to Plausible) Maybe that's something worth asking Plausible if they have a standard way to do it?

The raw data IP address and User-Agent are never stored in our logs, databases or anywhere on disk at all.

This is great to hear!

My one remaining question is: if running Matomo (or the equivalent) cost RECAP ~$500/month, and Plausible's servers are at least comparably expensive, and RECAP's use of Plausible costs less than $500/month, where is the difference coming from? If Plausible is operating at a loss, then it might not be as stable or reliable as you would like. If Plausible has some sort of operating efficiency that covers the entire difference, it would be good to know what that is. And if Plausible is instead monetizing the data of their customers (or their customers' users), then it would be good to understand what that practice is.

mlissner commented 1 year ago

My one remaining question is: if running Matomo (or the equivalent) cost RECAP ~$500/month, and Plausible's servers are at least comparably expensive, and RECAP's use of Plausible costs less than $500/month, where is the difference coming from?

Yeah, I don't know, but Matomo uses a lot more resources than I'd expect. I've never understood why. It's complicated and supports a lot of features that Plausible doesn't, but I could still never understand why it took so much oomph.

the higher the probability that someone with Plausible who has sysadmin keys has an incentive to misbehave

Yeah, sure, that's true, but again, you really can't get much this way. It's a bit like breaking into an empty vault. Big negative consequences if you're caught, not much inside.

That ought to be something that could be controlled (or at least detected) on the client side, in the browser extension

An interesting idea. I looked around a bit in the webextension docs, but couldn't find anything like this.

mlissner commented 1 year ago

I let this one simmer for a couple months and I'm re-reading it. I think I'm less enthused about it than I was, but that if we enable it, we should have an opt-in config for it in the settings. I think some people might opt into that and we might get some useful info that way.

For now though, I think I'm pretty happy with what RECAP supports, so I don't feel a ton of need to do this. I'm going to put it back on the back-burner again.