freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
531 stars 147 forks source link

RFC: Which cases should be available in public search engines? #691

Closed mlissner closed 3 years ago

mlissner commented 7 years ago

We've been getting a lot of removal requests lately, and it seems worth it to take a slightly less open approach to publishing.

Current proposals for removal from public search are:

  1. Non-precedential cases. This currently is about 500,000 cases.
  2. Cases from certain jurisdictions or courts that are more likely to have sensitive cases.
  3. Any case mentioning certain words (like "asylum", for example)

We can make these rules as complex as we can dream up, but the idea here is to strike a better line between "publish everything in the name of openness" and "people have legit privacy rights".

For example, looking at the above, I don't see a lot of benefit to publishing non-precedential cases — they're likely to be small-time, nobody is likely to look directly for them, and the people that these pages do attract to CourtListener probably didn't want to land there anyway. OTOH, some of the people in these cases are legitimately bad people, like fraudulent accountants and child molesters. OTOH again, some of these people are trying to move on with their lives and have served their time. Are we more moral to hide these or to show them? It's not altogether clear.

Currently, we only hide cases that:

  1. Have a social security number (which we also X-out), or
  2. Have an alien ID number, or
  3. Emailed to have their case blocked.

There are currently about 2000 cases that we've blocked through manual or automated means.

I'd be very interested in a discussion of how we might improve our approach to this problem.

JoshData commented 7 years ago

It might help to spell out what you think the privacy rights here actually are. e.g. Does it matter if an individual was on the losing side of a case (less privacy warranted) or the winning side (i.e. they should not be penalized for being falsely tried)? Could distinguish humans from other parties....

mlissner commented 7 years ago

I think any of these things could be relevant with the caveat that whatever we decide needs to be automatic. For example, we don't know who won a suit (and it's not always binary anyway), and we don't know if a party is a human, so those would be out (though still interesting to hear).

Colinstarger commented 7 years ago

Since you said to chime in on GitHub, I'm writing here rather than on Slack. Two initial thoughts.

(1) I would not want to take down "non-precedential" cases automatically. They can be very useful and I've used them many times. The federal rules were relatively recently changed to overrule a prior prohibition on citing to such opinions that was in effect in some jurisdictions. See FRCP 32.1. Fact is these cases can be very useful. So I'd only delete them IF it met the other criteria we identify. Standing alone, imho, non-precedential is not a reason to delete.

(2) In terms of thinking what to delete, what generalizations can you make from the requests that you've received and honored so far? Is that data set big enough to make rules from?

brianwc commented 7 years ago

Nothing's being deleted. You don't know me very well if you think I'd ever get rid of anything willingly! The question is merely about what we let the search engines put in their index. No litigant that has ever contacted us has ever much cared that the case was available to researchers on CL directly. They always are interested in just getting their name out of Google. That's the question.

On May 31, 2017 6:07 AM, "Colinstarger" notifications@github.com wrote:

Since you said to chime in on GitHub, I'm writing here rather than on Slack. Two initial thoughts.

(1) I would not want to take down "non-precedential" cases automatically. They can be very useful and I've used them many times. The federal rules were relatively recently changed to overrule a prior prohibition on citing to such opinions that was in effect in some jurisdictions. See FRCP 32.1 https://www.law.cornell.edu/rules/frap/rule_32.1. Fact is these cases can be very useful. So I'd only delete them IF it met the other criteria we identify. Standing alone, imho, non-precedential is not a reason to delete.

(2) In terms of thinking what to delete, what generalizations can you make from the requests that you've received and honored so far? Is that data set big enough to make rules from?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/freelawproject/courtlistener/issues/691#issuecomment-305180856, or mute the thread https://github.com/notifications/unsubscribe-auth/AAT1ObpmUiovqAPOYnlxV3Jo9O0Wag9dks5r_WYSgaJpZM4Nq_qM .

mlissner commented 7 years ago

@Colinstarger Do you want to retract your comment after seeing @brianwc's clarification, or is that what you meant?

Colinstarger commented 7 years ago

Thanks for tagging me Mike. Obviously, the difference between de-indexing and deleting is huge. But I still stick to the basic points I raised: (1) so-called "non-precedential" cases should be indexed as a general matter; and (2) thinking about what to do index should proceed based on evaluation of the 2000 currently blocked cases. So, to answer your question directly - I don't retract because I meant what I meant!

jraller commented 7 years ago

There is value in all of the following:

It looks like there are at least three methods by which a resource could be prevented from showing up in search results:

Resources include:

It appears that any actions by search engines are out of our control. We can respond to any request for removal short of a crawler requesting that all resources be removed. It is the last category or preemptive hiding that is the curious one.

I like the idea of keeping track of how many resources are not exposed to search engines and reporting that under "The Numbers". This serves the purpose of both disclosing to the public at large, without outing the particular resources that are hidden, and reminding us of what portion of our resources we are choosing to hide.

I think @Colinstarger makes a good case for keeping non-precedential cases unhidden.

I could see some family court, bankruptcy, asylum and other case types being considered for blanket hiding. I do think our policy need to be explicit about these decisions.

In that someone could use RECAP and make records more public than the court makes them (by removing the paywall and exposing their content to search engines) I'm leaning toward rules that hide portions of those resources automatically. This could be based on keywords or if the metadata supports it case types.

I'm less concerned at the moment about the audio of oral arguments, but that could just be because I'm not aware of what Google is capable of in that area.

I'm leaning toward keeping the Judge profiles as unhidden as possible.

I'd default to keeping visualizations unhidden unless a request was received.

PlainSite commented 7 years ago

PlainSite offers similar functionality to CourtListener, and we receive around ten takedown requests a day. It's admittedly quite difficult to handle them all, and it offers a sense of what it must be like to be an actual judge, constantly on the receiving end of a request firehose.

We think it's actually important to have multiple, competing legal research services precisely because different services will have different approaches not just to functionality, but also to the availability of data. Our Privacy Policy (https://www.plainsite.org/legal/privacy.html) spells out exactly how we handle requests. The one line summary is that we are strongly biased in favor of transparency and against any form of censorship (de-indexing, deletion, etc.), unless we see a valid court order, because:

  1. That approach is the only approach consistent with court precedent. Every takedown response we decline includes the following text (with links that don't carry over here): "There is a presumption of public access to court records and you have not identified a compelling privacy interest. "'Historically, courts have recognized a ‘general right to inspect and copy public records and documents, including judicial records and documents.’' Kamakana v. City & County of Honolulu, 447 F.3d 1172, 1178 (9th Cir. 2006) (quoting Nixon v. Warner Communications, Inc., 435 U.S. 589, 597 & n.7 (1978)). 'This right is justified by the interest of citizens in ‘keep[ing] a watchful eye on the workings of public agencies.’' Id. (quoting Nixon, 435 U.S. at 598). Access to judicial records, however, is not absolute. Id. A party seeking to seal a pleading or a dispositive motion (as well as any attached exhibits) must show that there are 'compelling reasons' to do so and that outweigh the public's interest in disclosure." See also Doe v. Lee, Docket No. 23, and Eng v. Pacermonitor.com et al."
  2. We aren't judges, elected or appointed. If the courts have put something in the public domain, we treat it as such.
  3. We and our principals have been in court before, repeatedly, against powerful defendants. There is always an incentive for the powerful and corrupt to hide information, and they are well aware of that incentive. When information cannot be obtained, fraud thrives. See "Donald Trump".
  4. Our experience handling takedown requests for five years suggests that the ones who shout the loudest and most viciously about being harmed are, without exception, the most dangerous individuals who merit the most transparency. Note the remarkable overlap in our tag for "Individuals Who Have Threatened PlainSite" (https://www.plainsite.org/tags/individuals-who-have-threatened-plainsite/) and robots.txt for Justia (https://dockets.justia.com/robots.txt).

One important difference, to the best of my knowledge, between our handling of takedown requests and CourtListener's is that our system is now almost completely automated. We get the odd request that requires a custom e-mail response, but most fit into our pre-determined categories for approval or declination. This greatly enhances request clarity, efficiency and [non-]customer satisfaction.

As for what to leave and what to de-index, it's not really our place to tell our competitors how to do business or run their non-profits. But that's how we run ours, and we run ours in accordance with principles that we believe are best for the country and the courts overall.

mlissner commented 7 years ago

Thanks for all the comments so far. A couple responses:

  1. @Colinstarger: I'm not sure how we'd analyze the 2000 cases so far. Here are a couple break downs, but I'd welcome more suggestions of ways to slice/dice the data:

    • 1400 are published (74%), 500 are unpublished (26%). Of the whole corpus, the breakdown is 3,398,756 published (86%), 539,529 unpublished (13%). So we definitely see that unpublished cases are more likely to receive a take down request. I'm uncertain if we should remove these from Google by default.
    • These are the top courts with more than 20 requests:

      (u'texapp', 412),
      (u'ca9', 118),
      (u'ca5', 70),
      (u'ca8', 60),
      (u'ca2', 58),
      (u'ca4', 56),
      (u'ca7', 51),
      (u'ca6', 49),
      (u'ca11', 47),
      (u'calctapp', 43),
      (u'ca10', 42),
      (u'ohsb', 37),
      (u'ca3', 35),
      (u'vactapp', 33),
      (u'connsuperct', 31),
      (u'dcd', 30),
      (u'washctapp', 29),
      (u'ctd', 29),
      (u'haw', 26),
      (u'vaed', 23),
      (u'ca1', 22),
      (u'ohioctapp', 21),

      Some of this probably reflects the number of items we have from each court (I know we have a lot from Texas, for example).

    • 63 (3%) of the blocked cases contain the word "asylum", compared to 29k (0.7%) in the entire collection. So asylum cases are an example of cases that people really want down. It's also worth noting that these people often barely speak English and they still figure this out. Notably, about 55% of asylum cases are unpublished, compared to the 13% in the entire corpus I mentioned above. I've never seen such a high number before for any search query.

      At this point, I'm inclined to take all asylum cases not in scotus out of public search engines.

    • What other ways should we analyze this?
  2. @jraller: I don't think we'd want to say how many cases are blocked, at least not on the homepage. This is available via an API request, but in general I don't think it's something most users would understand or be interested in.

    Also, sorry if this wasn't clear, but for the moment I'm only thinking about opinions, not RECAP, or judges or other object types. Thanks for your thoughts on all this though. It's still good to hear.

    I could see some family court, bankruptcy, asylum and other case types being considered for blanket hiding. I do think our policy need to be explicit about these decisions.

    I could get behind this too, but I don't think we have any family courts. For bankruptcy, what we've done in RECAP is to say that we hide bankruptcy cases with fewer than X number of docket entries under the assumption those are minor bankruptcies, and bigger (more important ones) will have longer dockets. It'd be nice to do the same thing here. Ultimately though, compared with things like asylum and child molestation, I'm less motivated to hide bankruptcies — they're embarrassing, but not like the other things.

  3. @PlainSite: I'm uncompelled by the argument that these cases should be in search engines because that's what courts do. If we knew courts/lawyers were thinking about this issue in any capacity, I'd agree with you, but my sense is that they're only just barely aware of the issue...sometimes...in some jurisdictions.

    I also think there's a distinction to be made between "In Google" and "In CourtListener". As @brianwc said, we don't remove things from CourtListener without a court order of some kind, and we agree that posting these on CL is incredibly important (obviously!). But, the argument about putting them into Google is less clear to me.

    Our experience handling takedown requests for five years suggests that the ones who shout the loudest and most viciously about being harmed are, without exception, the most dangerous individuals who merit the most transparency.

    I haven't analyzed this, but I do read the cases from some of the requests we get. It's a mixed bag. I'm sympathetic to some, but not to others, but in any case, I'm uncomfortable picking a side in most cases, even if we did have the resources to read and analyze all of them. Feels extra-judicial.

    Thanks for your comments here. Valuable to hear your thoughts and approach.

speedplane commented 7 years ago

I run Docket Alarm, which faces similar issues as other organizations here. We are upfront about our removal policy on our removal page.

I agree with @PlainSite that we are not judges. However, Docket Alarm comes to the opposite conclusion: we assume that if someone asks to suppress a page, it may be causing them harm, and unless we see an obvious public interest in the availability of the case, we suppress the page from search engine results.

We strive to automatically suppress from search engines the categories listed on our removal page using a variety of automated means. Further, we automatically suppress cases from search engines that contain keywords that suggest privacy interests may be at issue (e.g., "divorce", "children", "minor", etc.).

Some other pragmatic issues to consider:

  1. Aggressively publishing sensitive material (even if legal) may hurt efforts to open up court records, which in turn will have a negative impact on our larger and longer term goal to make legal information accessible.
  2. Being responsible stewards of sensitive material (even if public) raises the credibility of each of our organizations individually, as well as our industry group as a whole.
  3. Many take-down requests come from people who were not involved in a lawsuit, but have the same name as someone else who was. We should strive to eliminate confusion and any collateral damage caused by a strict open access policy.
  4. Many take-down requests come from people with limited monetary means, and we should not disproportionately harm those that are less able to protect themselves.
  5. Many take-down requests come from people with limited technical means. The take-down process should be easy-to-use so that those with limited technical abilities are not disproportionately harmed because they are unable to use an interface.
  6. The EU's "right to be forgotten", COPA, and some state laws potentially raise legal issues. Being a responsible legal steward of legal information will reduce potential risk to each of our organizations individually, and to our group as a whole.
  7. Under no foreseeable circumstances should we impose a fee on users to suppress material. While I'm sure someone could make a buck, it's immoral and we should condemn this behavior.
  8. While we suppress certain pages from search engines, paying users see all. The goal is to avoid causing harm as a result of informal Google searches, not to eliminate the record altogether.

In short, our mission should be a positive one on society, which responsibly considers and balances the benefits and harms of making court material readily accessible. We need to make clear that we take this responsibility seriously. Docket Alarm is happy to cooperate with the other organizations here to present a joint policy.

mlissner commented 7 years ago

@speedplane Thanks for sharing all this detail. I appreciate hearing your thoughts. You raise a lot of good points that my experience has borne out as well.

Can you share any more details about your automated system? For example it sounds like you're using keywords. Can you share the ones you're using?

I'd be down for creating a joint policy, but I don't know we have the bandwidth to build that kind of consensus.

brianwc commented 7 years ago

Wow. I'm glad so many people have contributed to this conversation. As a recovering philosopher, my inclination is to have long discussions on the most arcane subjects imaginable, but in this instance I'm less inclined to engage on all the good points being made because I'm compelled in this case by very mundane practical considerations. That is:

1) I have long wanted us to automate this process. If that's what @mlissner intends to do as a result of this discussion, then I'm open to a much wider range of solutions. 2) However, I have the impression that he instead intends to simply automatically de-index from search engines some subset of the corpus in an attempt to bring the request volume down and to continue to handle the remainder manually. Mike's time is better spent on almost anything else other than manually responding to these, so I want to minimize that effort. Consequently, I'd favor de-indexing automatically:

I want people to be able to find CL in search engines, but excluding these still leaves millions of search result entries in search and would, I hope, reduce the manual workload responding to these requests. Mike said all non-scotus asylum, but I think maybe the Circuit court level ones are too valuable a resource.

If one day we had better subject-matter tagging of cases, we could consider a non-jurisdictional approach (all employment cases at the trial court level, for example), but until then we have to take a more blunt approach in order to move on to more important tasks.

mlissner commented 7 years ago

@brianwc The approach I'm planning is to write a script that runs once per day. When it runs, it will use search to query for items that should be blocked. Any that come up will be blocked and the first time it runs will take a lot of stuff out of Google. If you can imagine a search query, we can put it into effect.

The two suggestions we've had so far are keyword- and jurisdiction-based blockages. So if we can come up with a list of keywords and jurisdictions that should be blocked, we should be in business for both existing content and anything we scrape in the future.

mlissner commented 7 years ago

Here's a proposal wrapping up the above. I'm throwing this over the fence so we can have something concrete to discuss.

Once per day, a script runs that tells google not to index the union of:

This will be on top of the regex approach we currently use to suppress SSN and Alien ID numbers.

Thoughts?

PlainSite commented 7 years ago

My two cents: the above approach is overly broad.

This goes to CourtListener's core purpose. If it's to only make readily available (e.g. Google, Bing, etc.) the least controversial cases that offend the fewest number of people, you might want to re-evaluate that aim. Obviously I'm not suggesting the opposite, that transparency sites should seek to offend. There has to be a balance. But balance takes a lot of work, not just keywords.

PlainSite definitely would not adopt or participate in any coordinated policy such as the one described above.

brianwc commented 7 years ago

So, as I suspected, it sounds like @mlissner is not automating the removal of whatever gets requested via our form, in which case I still just want to minimize his time spent on this for now. So, whatever policy we choose now, I intend as an interim measure (granted, that might last years) until the request-response process could be automated or mostly automated.

Ignoring the federal circuit courts, "asylum OR divorce OR minor" yields 383,707 precedential opinions today. I think that's probably overbroad. "asylum OR divorce OR (minor AND child)" brings it down to 223,852. Need to go to separate queries to make progress...

I think "divorce" probably only actually deals with a divorce in state court cases, so if you did a separate query that just did "divorce" in state courts that are NOT the court of last resort in that state (after 1967--50 years is a long time) it would bring that # down to 61,255 cases today. (If your divorce makes it to your state's Supreme Court, something interesting happened.)

"minor AND child" in that same set of lower-level state courts (after 1967) yields 8,243 opinions today.

Also, if you separately do "asylum" just in the federal district courts (after 1967), you only get 986 opinions. That gets the total for those queries in the ballpark of 70,000. Much better than the 383k I started with. Doh--None of this includes RECAP. Well, I'll leave that as an exercise for the reader.

@mlissner, you should separately check what the dates are of the currently blocked cases. Perhaps 1967 is too early. Do people really complain about cases before 1980? 1990? Is there some cutoff after which people seem not to care as much (that covers >80% of the cases)?

mlissner commented 7 years ago

@PlainSite It's a balance. We'll never have the resources to vet every case, so we need some default suppression policy that minimizes damage, exposes as much bad as possible, and avoids being extra-judicial. I don't think that the exposure we provide really moves the needle for most of the instances you mention above (e.g. nobody is Googling for Trump's bankruptcy, landing on our site, and breaking news). OTOH, I think we're measurably damaging people's reputation and life in some of these other instances. I want to publish as much as possible, but I don't want to ruin anybody's life. (It also doesn't help that getting these out of Google is nearly impossible.)

@brianwc Here's the histogram numbers for currently blocked cases by decade (sorry it's a crappy screenshot of a spreadsheet):

screenshot from 2017-06-05 13-20-45

The vast majority of these are from the 90's onwards, with about 5% from the eighties.

Another thought — I've noticed a fair number of people that were jailed for drug offenses. I'm inclined to block these too in the lower courts. My gut is these are largely victims of the drug war. When we've heard complaints, they were from people that were jailed and are now free again, trying to start over.

mlissner commented 7 years ago

OK, final proposal that I think I'm going to move forward with, at least for the time being. We block the union of:

  1. Everything containing "divorce" in a lower state court in the last 30 years (45k cases); AND
  2. Everything containing "minor" and "child" in a lower state court in the last 30 years (54k cases); AND
  3. Everything containing "asylum" in federal district courts in the last 30 years (1k cases); AND
  4. Everything containing "grams of cocaine" or "grams of crack cocaine" or "grams of marijuana" in any non-appellate court (state or federal) in the last 30 years (10k cases)
  5. Everything that's non-precedential in the last 30 years (537k cases); AND
  6. Everything in the last 30 years in these courts, as listed by @brianwc (52k cases, 36k are bankruptcy):

    • The specialized military courts (United States Air Force Court of Criminal Appeals, Armed Services Board of Contract Appeals, Court of Appeals for the Armed Forces, Army Court of Criminal Appeals, United States Court of Military Commission Review, Navy-Marine Corps Court of Criminal Appeals, United States Court of Appeals for Veterans Claims, Board of Veterans' Appeals)
    • U.S. Tax Court and all state-level tax courts
    • Merit Systems Protection Board
    • All district court-level bankruptcy opinions
    • All Workers Compensation Commissions, Industrial Claim Appeals Offices, Compensation Review Boards, or Departments of Industrial Accidents

(30 years seems to be enough to satisfy the vast majority of the block requests we've gotten so far.)

Rules 1-4 are about 120k cases and help minors, asylum seekers, victims of the drug war, and divorcées. I'm pretty OK with these just coming out of Google, though I worry a bit about the drug war rule, that we'll be aiding some pretty bad people.

Rule 5 is a big one, over half a million cases, but the court said these weren't relevant, and we hide them from our own results by default, so I guess I'm OK with these being hidden from Google too.

Rule 6 I don't have a strong opinion on because I mostly don't know what goes down in those courts. I'll defer to @brianwc as a result. I'm a bit hesitant to pull down so much bankruptcy content though. A fair bit of this is probably corporate bankruptcies...do we really want to hide those too? Could we use bankruptcy filing chapters to refine this?

mlissner commented 6 years ago

Another good word to block: "paternity"

mlissner commented 4 years ago

Another one: "Wrongful termination". I think these have a disproportionate impact because if an employer gets the scent that you might not be easy to fire, they will definitely not hire you.

mlissner commented 3 years ago

I'm finally proceeding with the plan to remove copies amounts of cases from Google, above. I've already removed a couple hundred thousand cases, and more or getting removed as I type.

I'll be adding this to the code for the going-forward basis too.

Last chance to review the categories, and to add some others.

mlissner commented 3 years ago

I used all the links in the comment above, except:

The code I used was:

In [26]: def block_by_str(s):
    ...:     page_size = 500
    ...:     main_query = build_main_query_from_query_string(
    ...:         s,
    ...:         {"rows": page_size, "fl": ["id", "cluster_id", "docket_id"], "caller": "cli"},
    ...:         { "highlight": False},
    ...:     )
    ...:     search = si.query().add_extra(**main_query)
    ...:     si.conn.http_connection.close()
    ...:     i = 0
    ...:     cluster_ids = set()
    ...:     docket_ids = set()
    ...:     paginator = Paginator(search, page_size)
    ...:     for page_number in paginator.page_range:
    ...:         print(f"Gathering ids. Doing page: {page_number}. {len(cluster_ids)} clusters; {len(docket_ids)} dockets.")
    ...:         page = paginator.page(page_number)
    ...:         for item in page:
    ...:             cluster_ids.add(item['cluster_id'])
    ...:             docket_ids.add(item['docket_id'])
    ...: 
    ...:     print(f"Found {len(cluster_ids)} clusters and {len(docket_ids)} dockets.")
    ...:     print("Updating clusters by chunk...")
    ...:     total_ocs = 0
    ...:     for i, chunk in enumerate(chunks(cluster_ids, 1000)):
    ...:         print("Doing cluster chunk: %s" % i)
    ...:         ocs = OpinionCluster.objects.filter(pk__in=chunk, blocked=True)
    ...:         total_ocs += ocs.update(blocked=True, date_blocked=now())
    ...: 
    ...:     print("Updating dockets by chunk...")
    ...:     total_ds = 0
    ...:     for i, chunk in enumerate(chunks(docket_ids, 1000)):
    ...:         print("Doing docket chunk: %s" % i)
    ...:         ds = Docket.objects.filter(id__in=chunk, blocked=True)
    ...:         total_ds += ds.update(blocked=True, date_blocked=now())
    ...: 
    ...: 
    ...:     print(f"Updated {total_ocs} clusters and {total_ds} dockets.")

And in the end, it was about 400k cases that got nuked. I will try to get this into our ingestion pipeline tomorrow so it's maintained, but in the meantime, I'm glad to have made these less accessible.

mlissner commented 3 years ago

I also added "grams of marijuana".

mlissner commented 3 years ago

This thread should have always been in a forum, not a bug tracker, and accordingly, I've posted it there for further discussion. Sorry for the many directions.

mlissner commented 3 years ago

There's two ways to do this. The first is to do it as we get content and the second is to do this once every week or month or whatever. The code is easier to write as a batch job because it's just a bunch of searches, but the bad part about that is that we have to mark the same cases as blocked over and over again b/c the blocked boolean isn't in the search DB.

So...doing it on an ingestion basis is probably better. It's more efficient anyway, most likely, but we'll have to hook it into every place we create content.

mlissner commented 3 years ago

Over 1,000 issues and PRs were filed since this ticket was created. That's really not great, but I'm happy to say that it's finally fixed as of now. This should be a privacy boost for those that are named by our cases.