freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
541 stars 149 forks source link

Sensitive data in opinions #742

Closed saizai closed 1 year ago

saizai commented 7 years ago

I ran Google's Data Loss Prevention (DLP) API on the opinions table.

The results are in a table which Mike has access to.

Here're the overall stats, sifted down to exclude hits for states and exclude obvious false positive fields like dates and sha1. "prob" is the DLP API's likelihood rating. "ops" is # unique opinions. "n/op" is # separate hits per opinion (which can double count eg between the text and various html fields; I've not tried to dedupe that).

From spot checking, it seems like about half of these hits are legit. It'll need manual verification.

A lot of this seems to me to be stuff that really ought to be redacted — preferably at the source (i.e. the courts). I'm not sure how to go about getting them to do that. Maybe talk to the Judicial Conference or AOUSC? Or the various circuits' administrative arms?

Also, the people affected should probably be contacted (and surveyed) before this is publicly disclosed.

ops n/op    prob    type
13  8.31    5   AMERICAN_BANKERS_CUSIP_ID
6   3.33    4   AMERICAN_BANKERS_CUSIP_ID
4846    2.4 3   AMERICAN_BANKERS_CUSIP_ID
405 2.13    2   AMERICAN_BANKERS_CUSIP_ID
91  1.95    3   AUSTRALIA_MEDICARE_NUMBER
542 2.09    2   AUSTRALIA_MEDICARE_NUMBER
253 2.16    1   AUSTRALIA_MEDICARE_NUMBER
549 3.89    3   AUSTRALIA_TAX_FILE_NUMBER
3268    5.49    2   AUSTRALIA_TAX_FILE_NUMBER
703 2.22    1   AUSTRALIA_TAX_FILE_NUMBER
216 7.4 2   BRAZIL_CPF_NUMBER
74  2.61    4   CANADA_BC_PHN
4448    2.81    3   CANADA_BC_PHN
591 2.6 2   CANADA_BC_PHN
2   2   3   CANADA_OHIP
363 2.01    2   CANADA_OHIP
615 2.75    1   CANADA_OHIP
1   2   4   CANADA_PASSPORT
12941   3.42    1   CANADA_PASSPORT
876 7.94    4   CANADA_QUEBEC_HIN
1   4   2   CANADA_QUEBEC_HIN
1936    2.75    3   CANADA_SOCIAL_INSURANCE_NUMBER
338 2.87    2   CANADA_SOCIAL_INSURANCE_NUMBER
612 2.69    1   CANADA_SOCIAL_INSURANCE_NUMBER
1   3   2   CHINA_PASSPORT
22  2.77    1   CHINA_PASSPORT
47  4.98    5   CREDIT_CARD_NUMBER
188 3.35    4   CREDIT_CARD_NUMBER
443 3.03    3   CREDIT_CARD_NUMBER
248 1.5 2   CREDIT_CARD_NUMBER
153 2.8 1   CREDIT_CARD_NUMBER
14800   2.32    5   EMAIL_ADDRESS
74  1.77    4   EMAIL_ADDRESS
162 1.64    3   EMAIL_ADDRESS
1   4   4   FRANCE_PASSPORT
6069    2.83    2   FRANCE_PASSPORT
762 2.41    1   FRANCE_PASSPORT
1   2   5   IBAN_CODE
5   2   3   IBAN_CODE
348 2.35    2   IBAN_CODE
4   1.5 1   IBAN_CODE
1   2   5   IMEI_HARDWARE_ID
63  2.02    4   IMEI_HARDWARE_ID
737 2.5 3   IMEI_HARDWARE_ID
117 2.75    2   IMEI_HARDWARE_ID
15  1.67    4   INDIA_PAN_INDIVIDUAL
210 4.18    5   IP_ADDRESS
2441    6.89    4   IP_ADDRESS
10  4.3 3   IP_ADDRESS
3   4   2   IP_ADDRESS
73  4.81    1   JAPAN_INDIVIDUAL_NUMBER
2   2   4   JAPAN_PASSPORT
50  2.16    2   JAPAN_PASSPORT
1089    3.59    1   JAPAN_PASSPORT
1   2   4   KOREA_PASSPORT
51  2.16    2   KOREA_PASSPORT
1116    3.59    1   KOREA_PASSPORT
1   1   3   KOREA_RRN
30  2.03    2   KOREA_RRN
2   2   1   KOREA_RRN
1   6   5   MAC_ADDRESS
23  2.43    2   MAC_ADDRESS
1   2   1   MAC_ADDRESS
17  2.06    2   MAC_ADDRESS_LOCAL
21459   7.24    1   MEXICO_PASSPORT
2   2   5   NETHERLANDS_BSN_NUMBER
1   2   4   NETHERLANDS_BSN_NUMBER
2126    2.53    2   NETHERLANDS_BSN_NUMBER
14300   2.66    1   NETHERLANDS_BSN_NUMBER
2837    3.38    5   PHONE_NUMBER
18906   4.59    4   PHONE_NUMBER
181491  6.01    3   PHONE_NUMBER
512566  5.29    2   PHONE_NUMBER
2113    2.87    1   PHONE_NUMBER
18  2.5 3   SPAIN_NIF_NUMBER
1   2   3   SPAIN_PASSPORT
11808   3.49    2   SPAIN_PASSPORT
12817   3.43    1   SPAIN_PASSPORT
32  2.06    5   SWIFT_CODE
4088    2.4 4   SWIFT_CODE
1   2   4   UK_DRIVERS_LICENSE_NUMBER
83  4.75    4   UK_NATIONAL_HEALTH_SERVICE_NUMBER
92  5.08    3   UK_NATIONAL_HEALTH_SERVICE_NUMBER
39  2.15    2   UK_NATIONAL_HEALTH_SERVICE_NUMBER
302 2.79    4   UK_NATIONAL_INSURANCE_NUMBER
83567   2.84    3   UK_NATIONAL_INSURANCE_NUMBER
1172    3.27    2   UK_NATIONAL_INSURANCE_NUMBER
4   1.5 1   UK_NATIONAL_INSURANCE_NUMBER
5   3.6 4   UK_PASSPORT
5158    2.95    1   UK_PASSPORT
4861    2.84    1   UK_TAXPAYER_REFERENCE
25  2.44    5   US_BANK_ROUTING_MICR
867 2.21    3   US_BANK_ROUTING_MICR
53  3.68    2   US_BANK_ROUTING_MICR
7   2.57    5   US_DEA_NUMBER
23  3.48    4   US_DEA_NUMBER
499 2.48    5   US_DRIVERS_LICENSE_NUMBER
8431    1.6 4   US_DRIVERS_LICENSE_NUMBER
81866   3.85    2   US_DRIVERS_LICENSE_NUMBER
1224    3.69    1   US_DRIVERS_LICENSE_NUMBER
853 2.26    4   US_HEALTHCARE_NPI
294 7.07    4   US_PASSPORT
5082    2.58    1   US_PASSPORT
40  2.35    5   US_SOCIAL_SECURITY_NUMBER
11  2.09    4   US_SOCIAL_SECURITY_NUMBER
123 2.28    3   US_SOCIAL_SECURITY_NUMBER
3355    2.53    2   US_SOCIAL_SECURITY_NUMBER
1110    3.76    1   US_SOCIAL_SECURITY_NUMBER
235 2.72    5   US_TOLLFREE_PHONE_NUMBER
2070    3.17    4   US_TOLLFREE_PHONE_NUMBER
18  2.5 3   US_TOLLFREE_PHONE_NUMBER
91  3.91    2   US_TOLLFREE_PHONE_NUMBER
681 3.89    5   US_VEHICLE_IDENTIFICATION_NUMBER
70  2.57    4   US_VEHICLE_IDENTIFICATION_NUMBER
mlissner commented 7 years ago

I'm not sure how to proceed here. I took a brief look at some of these.

Here's a query that should show SSN numbers (requires login/access): https://bigquery.cloud.google.com/results/make-transcriptions:bquijob_3efd5db1_15f25ffb3b3?pli=1

And a description of the fields: https://cloud.google.com/dlp/docs/infotypes-reference

Here's the first result it found that supposedly has an SSN:

https://www.courtlistener.com/opinion/2942992/a/

Which contains the text:

Appellant was convicted of murder in Cause No. 322040411 in the 411th District

That's the SSN that it found. Obviously a false positive. Here's the next result:

https://www.courtlistener.com/opinion/464738/a/

The "SSN" in that result is:

Account No. 021030004 from the U.S. Treasury Department, Federal Reserve Bank, New York City

Again, that's not an SSN. It's a bank account number and I don't think those are private (they're written on the bottom of checks).

I briefly checked a few others too, and came up similarly blank.

At the same time, it looks like you have hundreds of thousands of hits here — way more than we could review without funding. I think if we're going to make progress on this problem we need to:

  1. Triage which of the above codes matter. Do we care if there are phone numbers in the opinions, for example? It's not great...but not a huge deal either.

  2. Is there anything we can do to eliminate these false positives? Does Google have a "strict matches only" mode or something? I imagine if we find the pattern \d\d\d-\d\d-\d\d\d that will be much more productive than simply looking for any nine digits (which is what it apparently does now).

anseljh commented 7 years ago

Super interesting data!

Some of these probably would not need to be redacted, or at least I doubt there's a reasonable way you to automate it:

This doesn't actually narrow it down as much as I'd originally thought, but there are some values that can't be SSNs: https://www.codeproject.com/Articles/651609/Validating-Social-Security-Numbers-through-Regular.

johnhawkinson commented 7 years ago

I have mixed opinions about all of this, but I tend to think the "problem" of confidential data in court opinions is not a real problem in the way that it is in non-opinion filings where counsel have neglected their obligations under Rule 5.2 and their ethical duty to their clients. Each time a bit of potentially private data is included in an opinion, it's because a judge thought it was there to tell the story.

I tend to think that, absent a showing to the contrary, mass redaction of this type of data is not appropriate -- it's in the opinion and it's public. Getting it out of CourtListener wouldn't get it out of the Court's website and PACER and FDSYS and WestLaw and, and so has limited value.

As Ansel suggests, some of this isn't necessarily private, either functionally or realistically so. Email addresses just aren't inherently private. But on the other hand, I did recently take pause when I became a party in a case and realized consequently my email, mailing address, and phone number all became public (they're already public; but for some people they might not be). There's a difference between having them "public" in PACER behind a paywall and having them public for free and indexed by Google, et al. There is an argument that says maybe party information should be indexed with some care, maybe especially for non-incarcerated pro se litigants (who don't make the same choice as attorneys do; in some courts pro se litigants' addresses are redacted from opinions). I don't think this argument carries the day, but it's worth thinking about it.

Mike writes:

Again, that's not an SSN. It's a bank account number and I don't think those are private (they're written on the bottom of checks).

Umm. Bank account numbers are definitely sensitive, and the fact that they're at the bottom of checks is not much guidance to say they are not private. In particular, if anyone knows your bank account number and routing number (easy to determine from the bank name, if not already known), they can write checks against your account or execute ACH transfers against the account. Theoretically checks would be signature-verified by the bank, but not so with ACH. So there is some sensitivity associated with that information.

But in general, when we see much of this information in opinions, where it is part of the narrative (and not the contact information for counsel of parties), it's part of the story and presumptively invalid by the time the opinion comes out. So opinions about cell phone location information may include the cell phone number. It's probably long deactivated, usually years by the time the opinion rolls out. Similarly true for IP addresses, although in that case it's probably been recycled and is being used by someone (or multiple people, more likely). Also true for email addresses -- we get opinions about that include an email address when there's talk of how a search of an email account lead to evidence. By that point, in the relatively rare case where the opinion leads to acquittal -- even so, the person has probably stopped using the address in the meantime, and almost certainly so for convictions. (Most of this is in criminal litigation, not civil).

I don't think there's any appreciable risk from disclosing MAC addresses. It's pretty hard to get on the same local area network as a particular MAC address, and even if you could, advance knowledge of the MAC address doesn't give you an appreciable leg up.

US 800 numbers? They might as well be service marks. They seem fine.

I'm not sure of the exposure of things like VINs. I tend to think they are likely irrelevant by the time the opinion comes down.

Etc., etc.

mlissner commented 7 years ago

I'm not sure of the exposure of things like VINs. I tend to think they are likely irrelevant by the time the opinion comes down.

I generally agree, but just to be contrarian (and because it's ridiculous), good old Nissan was using the VIN as the API password or some similarly stupid thing not too long ago: https://www.troyhunt.com/controlling-vehicle-features-of-nissan/

I find your arguments compelling on the whole. Sounds like your stance boils down to, "Do nothing."

I think there are probably some instances that are bad — actual SSN's in opinions, for example, but I don't know how we'd find them beyond what we're already doing. (We already look for these X them out, and make sure that they're not searchable.)

Taking a more careful look, I think these can be broken down into a few categories:

Exceedingly likely to be false positives (mostly international stuff)

13  8.31    5   AMERICAN_BANKERS_CUSIP_ID
253 2.16    1   AUSTRALIA_MEDICARE_NUMBER
549 3.89    3   AUSTRALIA_TAX_FILE_NUMBER
216 7.4 2   BRAZIL_CPF_NUMBER
74  2.61    4   CANADA_BC_PHN
615 2.75    1   CANADA_OHIP
12941   3.42    1   CANADA_PASSPORT
1   4   2   CANADA_QUEBEC_HIN
1936    2.75    3   CANADA_SOCIAL_INSURANCE_NUMBER
1   3   2   CHINA_PASSPORT
1   4   4   FRANCE_PASSPORT
1   2   5   IMEI_HARDWARE_ID
15  1.67    4   INDIA_PAN_INDIVIDUAL
73  4.81    1   JAPAN_INDIVIDUAL_NUMBER
1089    3.59    1   JAPAN_PASSPORT
1116    3.59    1   KOREA_PASSPORT
2   2   1   KOREA_RRN
21459   7.24    1   MEXICO_PASSPORT
2   2   5   NETHERLANDS_BSN_NUMBER
18  2.5 3   SPAIN_NIF_NUMBER
1   2   3   SPAIN_PASSPORT
32  2.06    5   SWIFT_CODE
1   2   4   UK_DRIVERS_LICENSE_NUMBER
83  4.75    4   UK_NATIONAL_HEALTH_SERVICE_NUMBER
302 2.79    4   UK_NATIONAL_INSURANCE_NUMBER
5   3.6 4   UK_PASSPORT
4861    2.84    1   UK_TAXPAYER_REFERENCE

Meh (not sure it matters)

47  4.98    5   CREDIT_CARD_NUMBER
14800   2.32    5   EMAIL_ADDRESS
210 4.18    5   IP_ADDRESS
1   6   5   MAC_ADDRESS
17  2.06    2   MAC_ADDRESS_LOCAL
2113    2.87    1   PHONE_NUMBER
91  3.91    2   US_TOLLFREE_PHONE_NUMBER
681 3.89    5   US_VEHICLE_IDENTIFICATION_NUMBER

Somewhat problematic/unsure

1   2   5   IBAN_CODE
53  3.68    2   US_BANK_ROUTING_MICR
23  3.48    4   US_DEA_NUMBER
499 2.48    5   US_DRIVERS_LICENSE_NUMBER
853 2.26    4   US_HEALTHCARE_NPI
294 7.07    4   US_PASSPORT
40  2.35    5   US_SOCIAL_SECURITY_NUMBER
anseljh commented 7 years ago

I like this breakdown! And 40 opinions with possible SSNs is not that terrible to review manually.

If the rate is that low, then maybe this could be run on incoming opinions, and alerts sent to us for hits on US_SOCIAL_SECURITY_NUMBER.

johnhawkinson commented 7 years ago

I like this breakdown! And 40 opinions with possible SSNs is not that terrible to review manually.

Yeah. So…

https://bigquery.cloud.google.com/results/make-transcriptions:bquijob_3efd5db1_15f25ffb3b3?pli=1 40 2.35 5 US_SOCIAL_SECURITY_NUMBER

Doesn't work for me (access restriction). But email me the list and I'll look through ten of the 40 and summarize them.

saizai commented 7 years ago

@mlissner I don't have access to that BQ job.

A couple notes:

  1. use opinions_dlp_slimmed — in that table I dropped US_STATE and all hits in path/url, sha1, and date fields.

  2. I'm not sure how the numeric probability (1…5) maps to the values at https://cloud.google.com/dlp/docs/likelihood — it doesn't seem to be documented. However, my guess is that 5 is most likely and 1 is least, just based on the distributions — there ought to be more less-likely hits than more-likely ones.

  3. I've reported a few issues w/ google about the DLP API: https://issuetracker.google.com/issues/67835464 https://issuetracker.google.com/issues/67837023 https://issuetracker.google.com/issues/67837081

@mlissner

Again, that's not an SSN. It's a bank account number and I don't think those are private (they're written on the bottom of checks).

It is. FRCP 5.2(a) "financial-account number".

Checks are incredibly stupid from a security POV. Routing # + account # + some contact info is all you need to authorize an electronic check in many situations. But that's how it is.

Does Google have a "strict matches only" mode or something?

That's what the likelihood (prob) number is. 1…5.

@anseljh

EMAIL_ADDRESS: Almost any pleading will have this on the caption page for the attorneys' phone numbers. PHONE_NUMBER: Same; but for some documents it'd make sense to redact. You'd have to evaluate on a case-by-case basis, which doesn't seem reasonable.

This is on the opinions table, which does not have pleadings or docket info. These hits are all from the actual published opinion. Attorney contact info should not be in the published opinion.

IP_ADDRESS: Sometimes redacted, sometimes definitely not. Comes up a lot in subpoenas and discovery motions about them, in my own experience.

The hits are not redacted.

US_VEHICLE_IDENTIFICATION_NUMBER: I'd guess these come up a lot in asset forfeiture cases, to identify seized vehicles. Not sure.

I believe that's accurate.

This doesn't actually narrow it down as much as I'd originally thought, but there are some values that can't be SSNs: https://www.codeproject.com/Articles/651609/Validating-Social-Security-Numbers-through-Regular.

I believe all SSN hits reported survive that test.

@johnhawkinson

Getting it out of CourtListener wouldn't get it out of the Court's website and PACER and FDSYS and WestLaw and, and so has limited value.

I agree with that part. I don't think there's much value in redacting CL but not the other sources. That's why I talked about going for redaction at the source.

This is also part of why I think bulk notification and survey would be appropriate — both to the parties on the docket, and to whoever can be identified from the data itself. If someone's info is disclosed and they don't want it to be, they can (a) tell us so and (b) file a motion with the court to seal / redact.

Email addresses just aren't inherently private.

Tell that to the customers of any company that suffers a data breach disclosing emails. :-P

Also, as above, this is not the docket info. It's the opinion itself. Yes, your contact info needs to be available on the docket so that people can contact you about the case — but why should it be in the published opinion?

We don't currently know how many of the emails / phones / etc in these hits belong to parties, rather than to someone else. That's something that can be checked to some extent by joining against the RECAP docket metadata.

But it seems rather unusual to me that a judge would mention that info in the text of an opinion. It wouldn't happen just in the ordinary course of a case.

I don't think there's any appreciable risk from disclosing MAC addresses.

It's pretty useful for impersonating you on a network, and other various blackhat uses.

I'm not sure of the exposure of things like VINs. I tend to think they are likely irrelevant by the time the opinion comes down.

For one, note that VINs are unchangeable. They're permanently part of the vehicle, in multiple places, both sticker and metal-engraved. Not like a phone number or email you can change, at least not if you want to keep the car.

I don't know how vulnerable VINs are as an ID theft tool, however. I would guess that a VIN and some additional cleverness could get you into someone's DMV account, but I'm not sure. It could be used to help forge a bill of sale or the like.

It can be used for various database searches to dig up info on someone, if you have the right access. (I personally did so when I was a judicial extern — a party's assets were relevant to a case I was reviewing, so part of my research involved e.g. cross-checking their home & vehicle ownership registry info. Lawful access, using Westlaw, can give a remarkable amount of information.)

@mlissner

One way to check validate your false-positives intuition would be to do a random sampling from each category and likelihood number — e.g. 10 per — and see how many are true vs false positives. I'd be hesitant to just guess. If we can translate category/likelihood tuples into hit %, then we have something that's much more workable for triage.

Also, false-positive rates should not be conflated with disclosure severity. A UK taxpayer ID is no less sensitive than a US taxpayer ID.

mlissner commented 7 years ago

The topics here break down to "Does this disclosure matter for this field type?" and "What can/should we do?"

I think for the does this matter question, I propose:

As for what can/should we do, I like the idea of doing a bit of sampling. That'll give us a feel for the issue and we can judge from there. @johnhawkinson, I'll give you access to this data set, and you should be able to sample and test the data. I'm trying to keep this a bit locked down since this is (potentially) an ID theft's dream, but I'm pretty sure you're clean! I think you'll get a "You have access" email in a minute.

As for contacting the courts, Carl Malamud did something like this ages ago with PACER data. Here's the letter he sent: https://public.resource.org/scribd/7512579.pdf

mlissner commented 7 years ago

OK, looks like I can't give permission to the dataset. Sai, I'll email you John and Ansel's info. (Yes, I'm avoiding putting your email addresses on Github....)

johnhawkinson commented 7 years ago

Bank numbers don't matter. You freely give them away and banks don't treat this as private info.

Agree-to-disagree on this point and both sub-points. We don't give them away freely (we give them away very restrictively) and banks absolutely treat them as private.

As for contacting the courts, Carl Malamud did something like this ages ago with PACER data. Here's the letter he sent: https://public.resource.org/scribd/7512579.pdf

I'm not sure Carl was as successful as we would hope for.

saizai commented 7 years ago

@mlissner Bank numbers are protected by law and the FRCP. If they're in an opinion, that's highly suspect.

You're making rather big assumptions about emails and phones. There are unlisted phone #s, and nothing's to say that what's listed in the opinion is what someone intended to release. VINs matter just as much as account numbers; they're immutable, and while technically yes they are written on the car and you could read it if you look through the windshield, in practice their obscurity is relied on — same as account #s.

I'm running another couple passes at the DLP API call (to see if it'll complete), and then a big join to get the content and context from the opinions table. Should make it much easier to peruse & validate.

johnhawkinson commented 7 years ago

I'm running another couple passes at the DLP API call (to see if it'll complete), and then a big join to get the content and context from the opinions table. Should make it much easier to peruse & validate.

For what it's worth, I'd rather just have links to the opinion (PDFs). [but maybe i am naive.]

saizai commented 7 years ago

Searching for the hit in the full text manually is a pain. Easier to have it highlighted.

But the table has the CL IDs, so should be pretty easy to pull up the PDF if you want to.

saizai commented 7 years ago

Per https://issuetracker.google.com/issues/67835464 here are the enum conversion values:

enum Likelihood { LIKELIHOOD_UNSPECIFIED = 0; VERY_UNLIKELY = 1; UNLIKELY = 2; POSSIBLE = 3; LIKELY = 4; VERY_LIKELY = 5; }

johnhawkinson commented 7 years ago

Also, as above, this is not the docket info. It's the opinion itself. Yes, your contact info needs to be available on the docket so that people can contact you about the case — but why should it be in the published opinion?

One of the common patterns in N.D. Cal. is a proposed order which is filed by attorneys (so therefore has their email and phone numbers on the front page, because that's how they roll in California) which is converted to an actual order by the judge (as intended) with the consequence that phone and emails get preserved. E.g. https://www.courtlistener.com/docket/4178089/1270/apple-inc-v-samsung-electronics-co-ltd/

screen shot 2017-10-17 at 22 24 49

There are probably other ways in which this occurs, too.

You're making rather big assumptions about emails and phones. There are unlisted phone #s, and nothing's to say that what's listed in the opinion is what someone intended to release.

I guess I would disagree. Opinions are kind of tautologically intended for release. By at least some reasonable measure, the authoring Judge intended to release them.

VINs matter just as much as account numbers; they're immutable, and while technically yes they are written on the car and you could read it if you look through the windshield, in practice their obscurity is relied on — same as account #s.

This just isn't true. Equating their mutability isn't sufficient to determine they matter "as much as." You can steal money with account numbers. It's much harder to steal money with a VIN. But in both cases, if they're being listed in an opinon, they're probably irrelevant and no longer at real risk.

But, contrariwise, ecf.mad's daily report gives me:

. . . .
1:17-cv-10762-PBS United States of America v. ONE RED 2006 LAND ROVER RANGE ROVER SPORT, BEARING VEHICLE IDENTIFICATION NUMBER SALMF13486A216081, SEIZED FROM BYRON JONES ON JANUARY 24, 2012 2017-10-16 12:54:11 Chief Judge Patti B. Saris: ELECTRONIC ORDER entered Denied as stated in open court on 9/19/2017 re [4] Motion to Dismiss for Failure to State a Claim (Geraldino-Karasek, Clarilde)

 

which is to say, yeah, some of this stuff shows up, not only in opinions but in case captions. So that's a whole 'nother set of issues (is that an FRCP issue or a Local Rule issue or a Local Practice issue? I'm not really competent enough with respect to forfeiture to say).

anseljh commented 7 years ago

I've had a judge once tell me to take the attorney junk off a proposed order caption page, but yeah, 99% of the time a stipulation or proposed order that turns into an order will look exactly like that. It's not unique to the N.D. California. There will be zillions of these.

I am not greatly worried about VINs. It's worth spot checking to see if there's anything unexpected, but I expect we'll see mostly in rem cases like the lovely red Land Rover one above, and asset forfeitures. (In rem cases have the best case names.)

saizai commented 7 years ago

Please remember this is the opinions database, which doesn't include orders.

mlissner commented 7 years ago

I think we could make a prioritized list of which fields seem most problematic, but it seems that we're largely agreed it's the SSNs at the top (or near the top) of the list. For the sake of moving forward, I suggest we let John sample the SSNs, as he's offered to do, and then circle back to the other fields, depending on what John learns.

saizai commented 7 years ago

Agreed on that.

johnhawkinson commented 7 years ago

I suggest we let John sample the SSNs,

Just a note, John is still waiting for the list of opinions (offline) to sample :)

Please remember this is the opinions database, which doesn't include orders.

Does it not? The line between opinion and order is awfully fuzzy. Certainly if the metric is "free on PACER," then N.D.Cal. and C.D.Cal. routinely mark orders as opinions and thus they are free on PACER (which is great! More jurisdictions should do so!).

mlissner commented 7 years ago

This isn't the opinions that we got from PACER, but you're absolutely right that the line is fuzzy and we definitely have orders as well.

mlissner commented 1 year ago

So the last comment here was from five years ago, and I just stumbled upon it again. I seems like there's no momentum to get this done, so I'm closing. This isn't to say there aren't privacy issues in opinions (or filings), but rather to say that if we're going to tackle that problem, we should probably start a new thread (and I suspect this bigtable data is probably gone or obsolete).

Thanks for the discussion though. It was good and useful.