johnhawkinson commented 6 years ago

New electronic docketing system was introduced at the United States Supreme Court this week. Unfortunately it didn't seem to change much of their existing public-facing system, but we should scrape the system for filings and index them.

Some structural notes:

There is no way to get a list of recent filings across all cases, such as a global RSS feed.
Each case has its own RSS fee, e.g. https://www.supremecourt.gov/rss/cases/17-74.xml. For some reason the timestamp seems to bump daily, even though there is no visible docket activity. Also, the timestamp is marked as EDT but really is in EST.
Each case has a JSON and XML report: http://www.supremecourt.gov/RSS/Cases/JSON/17-74.json http://www.supremecourt.gov/RSS/Cases/XML/17-74.xml
Although the above don't have proper timestamps for linke PDFs, the name of the link contains a timestamp in "YYYYMMDDhhmmss{milliseconds}" format. e.g.: http://www.supremecourt.gov/DocketPDF/17/17-74/19830/20171113205323142_17-71%2017-74%20Weyerhaeuser%20Co.%20Markle%20Interests%20LLC.pdf

flooie commented 1 year ago

@johnhawkinson I assume this is stale and no longer and issue.

johnhawkinson commented 1 year ago

I assume this is stale and no longer an[] issue.

Err…why would you assume that?

I think everything in the original PR remains true, and with appellate RECAP on the horizon, setting sights on the Supreme Court seems more achievable.

flooie commented 7 months ago

Do we not already collect everything? @johnhawkinson

mlissner commented 7 months ago

This is about filings too, @flooie.

ralexx commented 5 months ago

I have a working implementation that downloads the SCOTUS docket JSON feeds and stores them in Sqlite. It uses a subset of Juriscraper's dependencies. Happy to work on adapting this if you'd like a PR, but I will need design guidance on where to fit this into the package structure and workflows.

mlissner commented 4 months ago

Ooooh, that's fun @ralexx. I think we'd want it to have a directory next to the pacer directory.

With that in mind, do you want to suggest a class hierarchy, and perhaps @flooie or @grossir can help guide your work so it fits with the rest of our style/approach/architecture?

ralexx commented 4 months ago

Sounds good, @mlissner.

Is this issue the best place to ask further questions?

mlissner commented 4 months ago

Yep, it's a great place to discuss things.

ralexx commented 4 months ago

I think we'd want it to have a directory next to the pacer directory.

I will work in juriscraper/juriscraper/scotus_docket and juriscraper/tests/examples/scotus_docket. You've used underscore naming for the oral_args/ directory so I'm following that, but I also noticed that pacerdocket.py doesn't use an underscore so happy to fit whichever style.

My first design question is about how you (@flooie and @grossir) would prefer to handle obtaining the Supreme Court docket numbers with which to scrape dockets. The only two sources of truth I've found are PDF documents:

The Journal of the Supreme Court, which appears to be a comprehensive listing of cases regardless of disposition.
The Supreme Court "Granted & Noted List", which looks like a subset of cases where certiorari has been granted.

The PDFs are pretty clean and I've had reasonable success extracting docket numbers and even some unique metadata from the Granted & Noted List using pypdf and regex. However, I'm not sure if you want to add pypdf as a package dependency. And the more prose-like format of the Journal may be a bit more tricky to parse; I haven't attempted this as yet.

Instead I resorted to brute force and am just trying docket numbers sequentially (I am rate-limiting my requests to 1/sec). It's hacky but it works. The challenge from a resource-usage point of view is that there are discontinuities in the docket numbering, e.g. YY-2000 to YY-4999 seem largely unused but from YY-5000 to YY-12000 about 10% of docket numbers are used. I tried using search algorithms to reduce the search space of docket numbers but it wasn't worth the effort.

Are you aware of any other comprehensive list of Supreme Court dockets?
Absent that, do you prefer the PDF parsing approach or the brute force approach to identifying active dockets?
If you prefer brute force, an easy way to shrink the docket number search space over time is excluding previously identified docket numbers. That would require persistent data storage that doesn't appear to be part of Juriscraper. Is this an approach you would like to investigate?

grossir commented 4 months ago

I have been looking into this and I need some clarification on the scope and integration into CL of this source @mlissner @flooie

What DB tables are we going to fill with this data?

It seems to me that this source contains mainly Docket and DocketEntry objects, and some related objects such as OriginatingCourtInformation, Party and some PDF Documents that belong to the DocketEntry object. About the documents, an example highlighted in blue from here:

However, I think we do not create DocketEntry from any source except RECAP. Should we change this here?

How are users interacting with this data?

I am guessing this is for setting up Docket Alerts, as in RECAP? If that's the case, the scraper should have 2 starting points:

"discovery" of previously unknown dockets (which ties into @ralexx analysis)
re-scraping known dockets (to fulfill alerts)

About the re-scraping, the RSS endpoint data is not static. If you check the example from the original comment, it has docket entries which are more recent than the comment itself.

Anyhow, I think we would need we will need to create a new caller on courtlistener to call the new juriscraper scraper and ingest this data, in the folder courtlistener/cl/scrapers/management/commands/, something like cl_scrape_dockets.py

grossir commented 4 months ago

hi @ralexx thanks for looking into this.

About folder structure

I will work in juriscraper/juriscraper/scotus_docket

I think it should be juriscraper/dockets/united_states/federal_appellate/scotus.py since we may want to support other docket scrapers in the future.

Also, this structure mirrors the one on juriscraper/opinions and juriscraper/oral_args

I will work in ... juriscraper/tests/examples/scotus_docket

similarly, I think it would be better to mirror examples/opinions/united_states and examples/oral_args/united_states and use examples/dockets/united_states/

About getting the docket list...

I don't think we should brute force the search, we try to be as gentle as possible with the server, since we use a user agent that identifies us (except for a couple of sources).

I am going to limit myself to checking the case of "discovery" of new dockets, as mentioned in the previous comment.

... from `Docket Search` page

There is a Docket Search page. The search string is used in a full text search, so if you query for February 2024, it returns dockets that contain such string. For example:

"SET FOR ARGUMENT on Wednesday, February 21, 2024",
"February 14, 2024 United States Court of Appeals for the Eighth Circuit Feb 01 2024 Motion to direct the Clerk"

The search supports exact results operator "February 21, 2024", which returns fewer results. However, the search is limited to the last 5 years, and to 500 results, and has a page size of 5, which could make us do several requests.

... from PDFs

About using the PDFs as source of docket numbers, I think it is a valid idea. About adding the pypdf dependency I don't know if @flooie is against it design wise, since we use doctor to extract text from PDFs. Maybe we could send requests to doctor from juriscraper?

About the Journals, they seem to be published / updated once each year, and the end of the term, so we wouldn't get fresh data from them (which is important if we are implementing this for alerting). They seem great for backscraping old cases.

The Granted and Noted List documents are updated more frequently, and we could indeed get the list from there. However, the "most recent" document October Term 2024 has an updated dated older than the one on the second document October Term 2023 (January 22, 2024 vs February 8, 2024), so we would have to check both

I think that if one of the use cases of this data is to alert users (previous comment), we should use the HTML Docket Search to collect the docket numbers, since by using a date-like query we will get recent and active cases, and do not limit ourselves only to the "Granted and Noted" list. What do you think @flooie?

flooie commented 4 months ago

@grossir have you looked at the json endpoints @johnhawkinson mentions.

ralexx commented 4 months ago

About using the PDFs as source of docket numbers, I think it is a valid idea. About adding the pypdf dependency I don't know if @flooie is against it design wise, since we use doctor to extract text from PDFs. Maybe we could send requests to doctor from juriscraper?

I will work with whatever PDF extractor you prefer; I’m not familiar with doctor so I simply reached for something I have used.

About the Journals, they seem to be published / updated once each year

Actually the documents are refreshed with a lag of 2-4 weeks. If you look at eg. p. 306 of the 2023 journal (the last page as I write this) you can see it is dated Jan. 10, 2024. But still not frequently enough to be useful, as you noted.

flooie commented 4 months ago

I assume we should build this to begin adding dockets directly into CL @mlissner?

@ralexx - doctor is our open source micro service tool we use to process documents for Courtlistener.

ralexx commented 4 months ago

Using full text search

It sounds as if belt-and-suspenders search for docket numbers is the way to go. @grossir you make a good point about the full-text search feature: I didn’t play with it much because it doesn’t expose the underlying database API, so one more layer of abstraction.

The full-text search definitely returns spurious matches for our purpose. We can reject them but we may still have to request the dockets and parse them to validate the docket entries. For example, on result from searching on “Feb|February 16, 2024” is

“Motion to extend the time to file a response is granted and the time is further extended to and including February 16, 2024, for all respondents.”

https://www.supremecourt.gov/search.aspx?filename=/docket/docketfiles/html/public/23-402.html was updated on 2024-02-02 but the search is matching the entry’s text.

The SCOTUS server supports the If-Modified-Since HTML header so we may be able to avoid the parsing step.

I’m on vacation this week, when I get back I will test full text search against a sample of dockets in the Journal. If that search returns a superset of the dockets with new entries on a given date then I will work on filtering out spurious matches before I report back.

flooie commented 4 months ago

@ralexx @grossir

The SCOTUS orders content looks like it could be a comprehensive source for all docket numbers. After reviewing the order list and the miscellaneous orders that are regularly published, I believe we have a solid method to identify all docket numbers along with their corresponding JSON endpoints.

I put together a simple script and managed to extract 3,199 docket JSON endpoints for the 2023 Term from the SCOTUS Orders page. Unless I am missing something, I dont think we need to mess around with docket search endpoints and can just parse the docket numbers from the orders when they are published.

ralexx commented 4 months ago

@flooie, take for example 23A745, Trump v. US (the immunity appeal from DC circuit). There hasn’t been any order on the case so I don’t think it appears on the Orders: there was an amicus brief filed on 2024-02-20 that I did not find in the Orders PDF for the same date.

Particularly in higher profile cases there are amicus briefs on the docket that may be of general interest; these tend to appear prior to an order (one exception appears to be an order denying amicus curiae leave to file). Relying on orders for docket numbers could leave a substantial lag before such dockets were scraped. I can test this next week.

flooie commented 4 months ago

Got it.

mlissner commented 4 months ago

Sorry to be a bit slow chiming in. I've been utterly swamped with meetings this week. A few things come to my mind as the boss man 'round these parts:

Goals

It sounds like we're on a path to scrape all SCOTUS content. That would be phenomenal, and it's something we should have been doing since the day they stood up their open docket website.

If we are to do this, speed and completeness will be essential, because SCOTUS content constantly goes viral among journalists. It's embarrassing for something with a gazillion re-posts not to be in our system at all.

Architecture

Do we want to use our current tables in CL?

I'm always on the fence about this, and it's a huge question: If we're going to add filings to CL from dozens or hundreds of courts, do we do it in one schema or in dozens? When we did Los Angeles Superior Court, it got its own schema and its own django app, and that seemed good. I think it's generally the right way to go.
How aggressively do we scrape?

Normally we try to be gentle on court server, but this is SCOTUS and from what I've gathered they're the one court in the land that actually has some sort of scalable architecture. We should still be smart and kind, but I think we can crawl aggressively if needed to hit completeness and speed goals. I would advocate for the least aggressive approach that is guaranteed to work within a few minutes of content being posted.
Do we store state in Juriscraper?

No. Juriscraper is supposed to be a library that other systems can build on top of. It is usually a wrapper for downloading and parsing particular pages on court web pages. For example you tell it:
- search for this query
- download this docket
- grab that PDF
And it provides you with JSON, response codes, and PDF binaries. If you need to do something like crawl the docket number space (except for certain numbers you already know about), the storage for that should be up a level in CourtListener or whatever other system is calling Juriscraper.
What about folder structure in Juriscraper?

I'm sorry to ask for another tweak here, but I would suggest one tweak with what Gianfranco said above. He suggested juriscraper/dockets/united_states/federal_appellate/scotus.py. I'd suggest going one step further with something like: juriscraper/dockets/united_states/federal_appellate/scotus/dockets.py. That will give you a module to add files for lots of different SCOTUS-related tools.
What about crawling/parsing PDFs?

This is usually my last resort. I'd rather download 100 URLs around the clock in a brute-force manner than rely on some slow, fragile, and unreliable PDF to do the job. But I haven't looked at the possibilities here. If it's the only way, so be it. I leave choice of library up to y'all, but as Bill knows, I usually prefer PyMuPDF. I think the API is a bit worse for scraping in particular, but the speed is (much) better and it's well maintained.

Putting this all together...

What all this means when you put it together is that we'll need the following components:

Juriscraper to scrape content and make it digestible
A daemon of some kind to call Juriscraper and scrape content around the clock.
Some database models, probably new ones, to store what we scrape.
HTML and views to display the content.
Integration into our APIs, bulk data, and search engines (somehow!).

I think that's it for my comments. Sorry this is a lot, but adding a new court is a fairly big task. I think it's worth doing though and I'm excited to have some momentum here.

ralexx commented 4 months ago

How do you (all) feel about the SCOTUS scraper having access to persistent state, whether an underlying docket database or its own database?

As far as I have seen, this doesn't happen elsewhere in Juriscraper: scraped data is presented to the caller for it to handle. I presume this statelessness is a design feature, but I'd like to clarify that.

Based on what I've found so far, there appear to be efficiency gains in downloading dockets when it's possible to discriminate between docket numbers that are known good and the rest. I will elaborate once I have some data to show you, but if you want SCOTUS scrapers to be stateless as a design principle then I won't work on/include that part of my prototype code.

mlissner commented 4 months ago

Yeah, I think it's best to keep Juriscraper stateless, but we'll certainly need that analysis when it comes to the calling code, so it can do that.

ralexx commented 4 months ago

I've looked into the different sources and types of information available at to guide my planning. Please let me know if you see that I'm missing something.

Published docket information

These are parts of supremecourt.gov most likely to be useful here. There are some additional sources such as the Orders of the Court by Circuit that I have omitted because they are merely different presentation of information found elsewhere.

Click for information sources summary table

| Source | Format | Completeness | Timeliness | |---|---|---|---| | [Journal of the SCOTUS](https://www.supremecourt.gov/orders/journal.aspx) | PDF | Supposedly, every disposition at the Court. Includes Bar admissions and other cruft. | "New Journal entries are posted on this website about two weeks after the event." That is being optimistic: as of this writing on March 1, 2024, the last Journal entry for the 2023 term is dated January 10, 2024. | | [Orders of the Court](https://www.supremecourt.gov/orders/ordersofthecourt/) | PDF | Unsigned orders, i.e. not including Opinions | "Regularly scheduled lists of orders are issued on each Monday that the Court sits, but 'miscellaneous' orders may be issued in individual cases at any time. Scheduled order lists are posted on this Website on the day of their issuance, while miscellaneous orders are posted on the day of issuance or the next day." | | [Granted/Noted Cases List](https://www.supremecourt.gov/orders/grantednotedlists.aspx) | PDF | Mostly grants of certiorari, including decided cases (for which there are opinions). Very limited subset of cases. | Less than a week of lag; as of March 1, 2024 the document is dated as of February 28, 2024. | | [Opinions Relating to Orders](https://www.supremecourt.gov/opinions/relatingtoorders/) | PDF | Only opinions that accompany select summary dispositions (typically dissents). These also appear at the back of the Orders documents. | "Any opinions...will be posted here on the day they are issued. " | | [Opinions of the Court](https://www.supremecourt.gov/opinions/slipopinion/) | PDF | All cases decided by the full court have their opinions published here. | These appear to be posted the day the case is decided. | | [Docket Search](https://www.supremecourt.gov/docket/docket.aspx) | HTML, JSON, XML | As far as I can tell, it's all here. | Reflects the timeliness of the underlying dockets. | | [Calendars and Lists](https://www.supremecourt.gov/oral_arguments/calendarsandlists.aspx): Argument Calendar | PDF | Contains docket numbers of cases scheduled for oral arguments. | Published 2-3 months prior to session (within a term) in which the arguments will be heard. | | [Calendars and Lists](https://www.supremecourt.gov/oral_arguments/calendarsandlists.aspx): Court Calendar | PDF | Waste of time unless you want to fine tune image recognition ML models. Easier to eyeball and transcribe; but can be useful for timing scrapes of Orders Lists, which appear "at 9:30 a.m. on the Monday following a Court conference, usually held three times a month when the Court is sitting". | Has all the key dates for a term. | |[Calendars and Lists](https://www.supremecourt.gov/oral_arguments/calendarsandlists.aspx): Hearing Lists | PDF | Published (late in the) prior week. | Somehow different from the Argument Calendar, I'm not clear on the distinction. Contains docket numbers of the cases to be heard on an upcoming day over the following week. | |[Calendars and Lists](https://www.supremecourt.gov/oral_arguments/calendarsandlists.aspx): Day Call | PDF | Apparently a daily update to the Argument Calendar, published the morning (ET) of the prior day. | Contains docket numbers of the cases to be heard that day. |

Docket pages

Email notification

...is available, but not if you're a 'bot. There is a graphical captcha interstitial to inhibit simply signing up for all dockets and receiving push notifications by email. Becase heaven forbid the unwashed masses might want what's theirs?

HTML pages

As returned by search queries: https://www.supremecourt.gov/search.aspx?filename=/docket/docketfiles/html/public/23-175.html

Also available directly: https://www.supremecourt.gov/docket/docketfiles/html/public/23-175.html

These are the pages linked to by the full-test docket search feature. They further contain links to JSON and XML representations of what I believe is the identical data. Not surprisingly I have not found systematic (or any, so far) discrepancies between HTML and JSON representations of dockets.

These pages are updated intra-day.

JSON pages

Found after navigating to RSS feeds of dockets: http://www.supremecourt.gov/RSS/Cases/JSON/23-175.json

XML representations are found by substituting ../XML/.xml in the above URL.

Contains all docket information. Updated intra-day.

Example: docket 23-939

{"CaseNumber":"23-939 ","bCapitalCase":false,"sJsonCreationDate":"03/01/2024","sJsonTerm":"2023","sJsonCaseNumber":"00939","sJsonCaseType":"Paid","RelatedCaseNumber":[],"PetitionerTitle":"Donald J. Trump, Petitioner","RespondentTitle":"United States","DocketedDate":"February 28, 2024","Links":"Linked with 23A745","LowerCourt":"United States Court of Appeals for the District of Columbia Circuit","LowerCourtCaseNumbers":"(23-3228)","LowerCourtDecision":"February 6, 2024","QPLink":"../qp/23-00939qp.pdf","ProceedingsandOrder":[{"Date":"Feb 12 2024","Text":"Application (23A745) for a stay, submitted to The Chief Justice.","Links":[{"Description":"Main Document","File":"2024-02-12 - US v. Trump - Application to S. Ct. for Stay of D.C. Circuit Mandate - Final With Tables and Appendix.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300410/20240212154110541_2024-02-12%20-%20US%20v.%20Trump%20-%20Application%20to%20S.%20Ct.%20for%20Stay%20of%20D.C.%20Circuit%20Mandate%20-%20Final%20With%20Tables%20and%20Appendix.pdf"},{"Description":"Proof of Service","File":"2024-02-12 - Certificate of Service for Stay Application.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300410/20240212154123465_2024-02-12%20-%20Certificate%20of%20Service%20for%20Stay%20Application.pdf"}]},{"Date":"Feb 12 2024","Text":"Petition for a writ of certiorari filed."},{"Date":"Feb 13 2024","Text":"Response to application (23A745) requested by The Chief Justice, due February 20, 2024, by 4pm (EST)."},{"Date":"Feb 13 2024","Text":"Brief amicus curiae of Jon Danforth, J. Michael Luttig, Carter Phillips, Peter Keisler, Larry Thompson, Stuart Gerson, et al. filed.","Links":[{"Description":"Main Document","File":"2024-2-13 Amici Curiae Brief Opposing Application for Stay.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300472/20240213120356911_2024-2-13%20Amici%20Curiae%20Brief%20Opposing%20Application%20for%20Stay.pdf"},{"Description":"Proof of Service","File":"Proof of Service.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300472/20240213120405955_Proof%20of%20Service.pdf"}]},{"Date":"Feb 13 2024","Text":"Brief amicus curiae of Constitutional Law Scholars filed.","Links":[{"Description":"Main Document","File":"Trump v. US CAC Scholars Brief.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300499/20240213140115053_Trump%20v.%20US%20CAC%20Scholars%20Brief.pdf"},{"Description":"Certificate of Word Count","File":"Trump v. US CAC Cert Compliance.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300499/20240213140124180_Trump%20v.%20US%20CAC%20Cert%20Compliance.pdf"},{"Description":"Proof of Service","File":"Trump v. US CAC Cert of Service.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300499/20240213140133124_Trump%20v.%20US%20CAC%20Cert%20of%20Service.pdf"}]},{"Date":"Feb 14 2024","Text":"Response to application from respondent United States filed.","Links":[{"Description":"Main Document","File":"23A745_Trump v. United States_Gov. stay resp_FINAL.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300627/20240214180323991_23A745_Trump%20v.%20United%20States_Gov.%20stay%20resp_FINAL.pdf"},{"Description":"Proof of Service","File":"23A745 - Trump v USA Certificate.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300627/20240214180338307_23A745%20-%20Trump%20v%20USA%20Certificate.pdf"}]},{"Date":"Feb 14 2024","Text":"Brief amicus curiae of Protect Democracy Project filed.","Links":[{"Description":"Main Document","File":"23A745 Trump v. USA_Amicus Brief.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300593/20240214142544324_23A745%20Trump%20v.%20USA_Amicus%20Brief.pdf"},{"Description":"Proof of Service","File":"23A745 Trump v. USA_Proof of Service.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300593/20240214142520222_23A745%20Trump%20v.%20USA_Proof%20of%20Service.pdf"}]},{"Date":"Feb 15 2024","Text":"Reply of applicant Donald J. Trump filed.","Links":[{"Description":"Reply","File":"2024-02-15 - 23A745 - Reply iso Application to S. Ct. for Stay of D.C. Circuit Mandate.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300749/20240215174027604_2024-02-15%20-%2023A745%20-%20Reply%20iso%20Application%20to%20S.%20Ct.%20for%20Stay%20of%20D.C.%20Circuit%20Mandate.pdf"},{"Description":"Proof of Service","File":"2024-02-15 - Certificate of Service for Reply iso Stay Application.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300749/20240215174038799_2024-02-15%20-%20Certificate%20of%20Service%20for%20Reply%20iso%20Stay%20Application.pdf"}]},{"Date":"Feb 15 2024","Text":"Brief amicus curiae of David Boyle filed.","Links":[{"Description":"Main Document","File":"23A745_tsac_DavidBoyle.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300655/20240215114857700_23A745_tsac_DavidBoyle.pdf"}]},{"Date":"Feb 16 2024","Text":"Brief amicus curiae of Alabama and 21 Other States filed.","Links":[{"Description":"Main Document","File":"States Brief in Trump v US FINAL 2.16.24.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300793/20240216132806756_States%20Brief%20in%20Trump%20v%20US%20FINAL%202.16.24.pdf"},{"Description":"Proof of Service","File":"Certificate of Service for States Br. FINAL.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300793/20240216132818877_Certificate%20of%20Service%20for%20States%20Br.%20FINAL.pdf"}]},{"Date":"Feb 19 2024","Text":"Brief amicus curiae of Christian Family Coalition filed.","Links":[{"Description":"Main Document","File":"23A745 Amicus CFC.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300875/20240219151217308_23A745%20Amicus%20CFC.pdf"},{"Description":"Certificate of Word Count","File":"CERTIFICATE OF COMPLIANCE.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300875/20240219151223585_CERTIFICATE%20OF%20COMPLIANCE.pdf"},{"Description":"Proof of Service","File":"CERTIFICATE OF SERVICE.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300875/20240219151229160_CERTIFICATE%20OF%20SERVICE.pdf"}]},{"Date":"Feb 19 2024","Text":"Brief amicus curiae of Jeremy Bates filed.","Links":[{"Description":"Main Document","File":"amicus brief oppn to stay 2 19 2024.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300877/20240219152614536_amicus%20brief%20oppn%20to%20stay%202%2019%202024.pdf"},{"Description":"Proof of Service","File":"COS amicus brief oppn to stay 2 19 2024 .pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300877/20240219152631553_COS%20amicus%20brief%20oppn%20to%20stay%202%2019%202024%20.pdf"}]},{"Date":"Feb 20 2024","Text":"Brief amicus curiae of Former Attorney General Edwin Meese III, Law Professors Steven Calabresi and Gary Lawson, and Citizens United filed.","Links":[{"Description":"Main Document","File":"Trump v US Stay Amicus Final.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300972/20240220173530766_Trump%20v%20US%20Stay%20Amicus%20Final.pdf"},{"Description":"Certificate of Word Count","File":"Certificate Word Count.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300972/20240220173536501_Certificate%20Word%20Count.pdf"},{"Description":"Proof of Service","File":"Proof of Service.pdf","DocumentUrl":"http://www.supremecourt.gov/DocketPDF/23/23-939/300972/20240220173542733_Proof%20of%20Service.pdf"}]},{"Date":"Feb 28 2024","Text":"Application (23A745) referred to the Court."},{"Date":"Feb 28 2024","Text":"Petition GRANTED."},{"Date":"Feb 28 2024","Text":"The application for a stay presented to The Chief Justice is referred by him to the Court.  The Special Counsel’s request to treat the stay application as a petition for a writ of certiorari s granted (23-939), and that petition is granted limited to the following question:  Whether and if so to what extent does a former President enjoy presidential immunity from criminal prosecution for conduct alleged to involve official acts during his tenure in office.  Without expressing a view on the merits, this Court directs the Court of Appeals to continue withholding issuance of the mandate until the sending down of the judgment of this Court.  The application for stay is dismissed as moot.  \r\n     The case will be set for oral argument during the week of April 22, 2024.  Petitioner’s brief on the merits, and any amicus curiae briefs in support or in support of neither party, are to be filed on or before Tuesday, March 19, 2024.  Respondent’s brief on the merits, and any amicus curiae briefs in support, are to be filed on or before April 8, 2024.  The reply brief, if any, is to be filed on or before 5 p.m., Monday, April 15, 2024."},{"Date":"Feb 29 2024","Text":"Record requested from the United States Court of Appeals for the District of Columbia Circuit."},{"Date":"Mar 01 2024","Text":"Record received electronically from the United States Court of Appeals for the District of Columbia Circuit and available with the Clerk."},{"Date":"Mar 01 2024","Text":"Record received from the United States District Court for the District of Columbia. The record is electronic and is available on PACER."}],"AttorneyHeaderPetitioner":"Attorneys for Petitioner","Petitioner":[{"Attorney":"D. John Sauer","IsCounselofRecord":true,"Title":"James Otis Law Group, LLC","PrisonerId":null,"Phone":"314-562-0031","Address":"13321 North Outer Forty Road\r\nSuite 300","City":"St. Louis","State":"MO","Zip":"63017","Email":"John.Sauer@james-otis.com","PartyName":"Donald J. Trump"},{"Attorney":"D. John Sauer","IsCounselofRecord":true,"Title":"James Otis Law Group, LLC","PrisonerId":null,"Phone":"314-562-0031","Address":"13321 North Outer Forty Road\r\nSuite 300","City":"St. Louis","State":"MO","Zip":"63017","Email":"John.Sauer@james-otis.com","PartyName":"President Donald J. Trump"}],"AttorneyHeaderRespondent":"Attorneys for Respondent","Respondent":[{"Attorney":"Michael R. Dreeben","IsCounselofRecord":true,"Title":"Counselor to the Special Counsel","PrisonerId":null,"Phone":"202-305-9654","Address":"Department of Justice\r\n950 Pennsylvania Ave, NW","City":"Washington","State":"DC","Zip":"20530","Email":"SCO_JLS_SupremeCtBriefs@usdoj.gov","PartyName":"United States"}]}

Docket numbering

There are four docket types but all use a consistent numbering format: \d\d[-AMO]\d{1,5}. I have found four types of dockets based on their identifier symbol:

Petition (regular) dockets: '-'
Application dockets: 'A'
Motion dockets: 'M'
Original(?) dockets: 'O'

In the JSON presentations of dockets, the 'sJsonCaseNumber' field is zero-padded to the full five digits, e.g. for docket 23-2 that field contains '00002'. I believe this is the only use of zero padding in the docket numbers found on the SCOTUS site, probably to avoid lexical sorting in SCOTUS's docket database.

Petition (regular) dockets

Petitions for writ of certiorari are given these docket numbers, e.g. 23-175.

The SCOTUS Public Information Office offers some guidance here:

All cases receive a docket number upon filing in the Clerk's Office, ranging from 3 to 7 digits (e.g., 21–1, 21–2000).

The term In Forma Pauperis (IFP) describes permission given to an indigent to proceed without liability for Court fees or costs.* "Pauper" cases are always given up to a 7–digit number with the last digits up to the 10,000 or 11,000 series (e.g., 21–5661, 21-10269).

IFP cases seem to number consistently starting at <YY>-5001.
Docket numbering between 2000-5000 is not consistent; I don't have a sense of any pattern here.
Dockets 5001 and up appear most likely to have few docket entries, and seem most likely to be summarily denied writs of certiorari.

Application dockets

Applications to extend the time to file petitions for certiorari; or for stays pending disposition of petitions for certiorari; possibly other types of actions that I haven't found yet.

When successful, these cases become petition dockets with regular docket numbers. However, there does not appear to be any link from the A docket to the - docket; only in the other direction. Thus petition docket 23-232 has a reference back to the application 23A2 that preceded it,

{...,
Links: "Linked with 23A2",
...}

but there is no reference in 23A2 to 23-232.

Motion dockets

Movants often ask for leave to proceed with their applications as veterans. I'm not clear on what this status accords movants, but it seems to offer some distinction from Applications.

"Orig." dockets

I have found one case, 22O141 (Texas v. New Mexico and Colorado) that is listed in the Granted/Noted List as "Orig. 141". Maybe there are more? But I've included this identifier in the regex patterns.

ralexx commented 4 months ago

Progress update

Constrained by lack of a central index

The lack of a single source of truth about SCOTUS dockets looks like it will be the major determinant of how scraping can proceed.

Unfortunately, the simplest solution has been intentionally placed out of reach. SCOTUS's own docket activity email notification system is protected with the ominous-sounding BotDetectCaptcha scripts.

That leaves us needing two dimensions of information: where to look for docket information, and when/how often to look there.

When to look

If handling state will be up to the scrapers' caller, as @mlissner suggested above, I would like to leave this part for later, or for others.

Where to look

So far I have tried three approaches:

Brute-force queries of docket number ranges.
Extracting docket numbers from the Granted/Noted List and from Orders of the Court, both of which come as PDFs.
The Docket Search page, as pointed out by @grossir. It returns URLs to docket HTML pages; I then regex match the docket numbers in the URLs and scrape the corresponding JSON dockets.

My sense is that some combination of Docket Search query results and brute force will be closest to optimal in terms of server resource use, completeness, and timeliness.

Docket Search

Because the date strings for docket entries in the JSON presentation use a consistent format (strftime "%b %d, %Y"), search results on those date strings are generally fairly accurare (true positives / total).

Click for search accuracy table

| | date_string | spurious | all | accuracy | |---:|:-----------|-----------:|------:|-----------:| | 0 | 2024-03-01 | 130 | 165 | 0.212121 | | 1 | 2024-02-29 | 38 | 147 | 0.741497 | | 2 | 2024-02-28 | 28 | 127 | 0.779528 | | 3 | 2024-02-27 | 26 | 96 | 0.729167 | | 4 | 2024-02-26 | 31 | 203 | 0.847291 | | 5 | 2024-02-25 | 2 | 5 | 0.6 | | 6 | 2024-02-24 | 0 | 6 | 1 | | 7 | 2024-02-23 | 30 | 200 | 0.85 | | 8 | 2024-02-22 | 15 | 141 | 0.893617 | | 9 | 2024-02-21 | 16 | 130 | 0.876923 | | 10 | 2024-02-20 | 15 | 450 | 0.966667 | | 11 | 2024-02-19 | 6 | 19 | 0.684211 | | 12 | 2024-02-18 | 2 | 4 | 0.5 | | 13 | 2024-02-17 | 3 | 6 | 0.5 | | 14 | 2024-02-16 | 33 | 486 | 0.932099 | | 15 | 2024-02-15 | 18 | 171 | 0.894737 | | 16 | 2024-02-14 | 13 | 112 | 0.883929 | | 17 | 2024-02-13 | 12 | 84 | 0.857143 | | 18 | 2024-02-12 | 22 | 124 | 0.822581 | | 19 | 2024-02-11 | 8 | 12 | 0.333333 | | 20 | 2024-02-10 | 2 | 10 | 0.8 | | 21 | 2024-02-09 | 16 | 130 | 0.876923 | | 22 | 2024-02-08 | 12 | 173 | 0.930636 | | 23 | 2024-02-07 | 9 | 135 | 0.933333 | | 24 | 2024-02-06 | 11 | 83 | 0.86747 | | 25 | 2024-02-05 | 12 | 100 | 0.88 | | 26 | 2024-02-04 | 2 | 3 | 0.333333 | | 27 | 2024-02-03 | 2 | 7 | 0.714286 | | 28 | 2024-02-02 | 10 | 115 | 0.913043 | | 29 | 2024-02-01 | 9 | 176 | 0.948864 |

Here I classified as spurious any docket in the search results whose whose 'Last-Modified' header or whose last docket entry date was less than the date given by the search string.

Other benefits from using Docket Search:

Dockets are updated intraday, and I believe (more testing needed) that these updates are captured by searches in real time.
The docket numbers used in the URLs returned by the Search don't have text artifacts like PDFs do. I believe that regex matching on those URLs is faithfully capturing all docket numbers returned by each search.

PDF extraction

What @mlissner said. I used pymupdf and it seems to give decent results, but the text artifacts it returns (e.g. atypical Unicode dashes U+2010,...,U+2014) means a fair bit of trial and error on regex patterns just for docket numbers.

Combined with the idiosyncrasies of publication times for the various documents, I think sources like the Orders pages (as @flooie pointed out) can be useful for backscraping and making sure that higher-profile dockets have been updated. But they don't update frequently enough to be useful for real-time scraping.

Brute force

Assume -- in the absence of testing -- that Docket Search results are precise and false negatives (dockets updated but not reflected in the search results) are low. That would leave brute force searches largely for discovery of new dockets, particularly the A/M/O types.

The good support for 'If-Modified-Since' request headers on is encouraging. I've found the status code 304 behavior of not downloading assets speeds up both Docket Searches and downloads of known-good docket numbers. More of a "light touch" than brute force.

Upstream questions

I have the core of @mlissner's first action item,

Juriscraper to scrape content and make it digestible

and I'm working on making the code more robust to network and parsing errors.

One area where I need your input is with the database model(s) that @mlissner mentioned. I haven't tried to write a full docket parser since I don't know what the caller's data requirements will be. But I will need to turn to that.

Another question I have is about object APIs. Should I assume that the SCOTUS docket parser will called from the command line as python -m, such that object interfaces matter less than what's in argparse and main()? Or should I be trying to make objects conform to existing API patterns e.g. AppellateDocketReport?

mlissner commented 4 months ago

Thanks for all this work and detail!

A couple thoughts:

I think we should just assume we can sign up for email updates. I just wrote a note to SCOTUS asking for help doing this. If they help us, great. If not, let's bust captchas and sign up for email alerts.

There are APIs for busting captchas now, and I think we should put them on the table, if we need them. If SCOTUS doesn't reply to my message, I think an analysis of available captcha-busting APIs would be really useful as a first step here (maybe in a separate issue?).
If we can get email updates, does that help us learn about new cases or does that only help with cases we already know exist?
I love that you're optimizing things by using their search tool, but I feel like brute force is a more reliable method, no? Could we imagine an algorithm that explores the docket number space to identify which dockets exist? Then, once we know a docket exists, we could subscribe to it for email updates?
If we do the email update route, the architecture changes a bit. We'll want:
- An email address that can be used for this. (Maybe we use https://recap.email for this, and set up scotus@recap.email?)
- When that email comes into AWS, it'll send an HTTP post to our API, so we'll need an API endpoint to handle that. We might be able to use the API endpoint we currently use for recap.email, but it's probably better to set up a different one.
- Then, once the API is hit, we will need a parser for the email that can at least get the docket number, so we can respond scraping the website for the latest info. We could also try to scrape other details from the email, but it usually gets painful fast.
Does that sound right?

ralexx commented 4 months ago

Email notifications

There are APIs for busting captchas now, and I think we should put them on the table, if we need them.

I would not want to open up that can of worms myself. Leaving aside the poetic irony of doing that to the SCOTUS, you will want/need an obfuscation layer for IP addresses, user agent string, etc; and if their admins decide that 'scotus@recap.email' suddenly appearing on 40K-ish dockets is not consistent with human effort, you could be back to square one. That could devolve into Mutually Assured Whac-A-Mole.

Other approaches, including what I've described, are going to be sub-optimal but they can be complete. I can continue to work on those but I'm not going to mess with captcha defeats.

Docket search

I love that you're optimizing things by using their search tool, but I feel like brute force is a more reliable method, no?

Agreed. But after looking into it, I think @grossir was right to point out the search interface as a viable tool. In a nutshell, search interface for fast but possibly incomplete results, combined with brute force sequential downloads for slow but complete results.

As mentioned I think issues of search timing aren't blockers for scraping at this point. But if we take email notifications as the gold standard, there's already a modest lag between content arriving at supremecourt.gov and the notifications going out. See for example this notification I received:

"Amicus brief of United States..."

``` Return-path: <0100018e0bde5707-5b7c53ec-c4e8-42eb-85db-609620574298-000000@amazonses.com> Received: from [IPA] (helo=mailfront20.[server]) by delivery05.[server] with esmtp (Exim 4.86_2) id 1rhI2u-0006v5-EI for [me]@[server].com; [timestamp] Received: from exim by mailfront20.[server] with sa-scanned (Exim 4.93) id 1rhI2t-008ah8-RM for [me]@[server].com; [timestamp] [...snip...] envelope-from=0100018e0bde5707-5b7c53ec-c4e8-42eb-85db-609620574298-000000@amazonses.com; helo=a65-150.smtp-out.amazonses.com Received: from a65-150.smtp-out.amazonses.com by mailfront20.[server] with esmtps (TLS1.2:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_128_CBC__SHA1:128) (Exim 4.93) id 1rhI2k-008ae1-5C for [me]@[server].com; [timestamp] DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=sglce367a3eekz5cgo2jpvwdiom4ooya; d=sc-us.gov; t=1709596104; h=From:To:Subject:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-ID:Date; bh=28APxLDdZGKYN8cytr+CtYFVI2hMZwtFrwWODzxt9Ug=; b=XKzxz0kZ9d6WlgHjyJLkc7DBW9V5daLi8yTCiItcWh28DrC1ywbbTAmWpLqQ56pV 8EOOk2itFhQujXkdijeMql/eM3GUQRSjnDJtfhKluWmoW2xupKRRbYq7/mw7otNSHg0 IknCxfbKCXI8ENOchtz7DiDnana1QvPbligBLP24= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=224i4yxa5dv7c2xz3womw6peuasteono; d=amazonses.com; t=1709596104; h=From:To:Subject:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-ID:Date:Feedback-ID; bh=28APxLDdZGKYN8cytr+CtYFVI2hMZwtFrwWODzxt9Ug=; b=SJWIgiOt/V8jsRjxpv50sS9YGlkCZTLjUlkAZxnb0S+mi9XmvFDjSl72VTekjf45 pcc8t/PCC27u847OPbrF/EmNlkQysSxbMTjOl8q/s5DJF9Jslhm80PrOHGR/uv4AYd6 oV8S2BGXitN/SrkbmdpTopTq1GDicECgd8pyqyfY= From: no-reply@sc-us.gov To: [me]@[server].com Subject: Supreme Court Electronic Filing System MIME-Version: 1.0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 7bit Message-ID: <0100018e0bde5707-5b7c53ec-c4e8-42eb-85db-609620574298-000000@email.amazonses.com> Date: Mon, 4 Mar 2024 23:48:24 +0000 Feedback-ID: 1.us-east-1.+5GeZMB3eXeyv3WY8brP46tghxJpXFIF9yDDvLuTQrk=:AmazonSES X-SES-Outgoing: 2024.03.04-54.240.65.150 A new docket entry, "Amicus brief of United States submitted." has been added for City of Grants Pass, Oregon, Petitioner v. Gloria Johnson, et al., on Behalf of Themselves and All Others Similarly Situated. You have been signed up to receive email notifications for No. 23-175.

If you no longer wish to receive email notifications on this case, please click here. ```

Look at the notification lag by comparing the email time stamp header (Date: Mon, 4 Mar 2024 23:48:24 +0000) with the JSON docket header (last-modified: Mon, 04 Mar 2024 23:42:46 GMT), and lastly with the time stamp portion of the docket filing URL http://www.supremecourt.gov/DocketPDF/23/23-175/302264/20240304183726571_23-175npUnitedStates.pdf (i.e. 2024-03-04-T18:37:26, which should be EST so 23:37 GMT). Around five minutes between the docket item hitting the web site and the notification email arriving.

Compare that with scraping results from the docket search interface. I just ran a test using the string 'Feb 3, 2024', including sending the 'If-Modified-Since' filter for 2024-03-04T00:00:00, and I was able to download the 174 valid dockets (i.e. dockets actually updated on/after 2024-03-04) out of 198 search results in one minute and 37 seconds. Multi-threaded or asyncio performance would obviously be even better although I believe httpx integration is a separate issue in your project. Running this job every, say, five minutes wouldn't be a huge tradeoff to avoid messing with captchas, I hope.

Could we imagine an algorithm that explores the docket number space to identify which dockets exist? Then, once we know a docket exists, we could subscribe to it for email updates?

This is what the petition (i.e. regular) docket number search space looks like after a few days' worth of brute force attempts (X axis is SCOTUS term, Y axis is the sequential integer portion of the docket number):

There is just a single Y axis in the graph; what appears to be the bottom series is actually the "pauper" case numbers that seem to begin strictly at 5000, as described in my earlier note. For years 2018+ the numbering is not perfectly contiguous but close.

Unfortunately, that 'close' is too sparse for e.g. bisection or ternary searches; I tried. Instead for docket number discovery I have been doing sequential searches on the 1-2500 and 5000+ ranges, excluding known good docket numbers, and limiting the number of "Not Found" page results before the search quits. Slow, but effective. And embarrassingly parallel. It just requires known good docket numbers as state.

mlissner commented 4 months ago

This is great, thank you. Two questions come to mind if we don't want to do the Capcha thing and the court doesn't get back to us (I just emailed again).

How often would it be practical to update the dockets if we have to crawl them all and we do it in parallel?
Am I gathering that doing the brute force approach without doing search is viable? It seems more reliable to me.
Is there a way to know that we can stop searching a particular case? Something to indicate that it's completed?

Sorry for the naive questions. I'm really leaning on your help, but I really appreciate the research you're doing!

flooie commented 4 months ago

I mean - we could just manually subscribe to each one - or all the important cases? its not ideal but considering the source would be appropriate I think.

mlissner commented 4 months ago

You mean with a human instead of a captcha buster?

flooie commented 4 months ago

I am

mlissner commented 4 months ago

I think I'd rather automate it, even if that involves busting captchas or scanning their website every few minutes for updates. Let's see what @ralexx thinks about the scanning idea (they said it was "embarrassingly parallel," which is promising), but my general thought is we shouldn't set ourselves up to have to do things as humans, because that scales poorly and we're bad at it.

ralexx commented 4 months ago

@mlissner you reminded me I didn't have my pronouns set on my profile; thanks. He/him.

How often would it be practical to update the dockets if we have to crawl them all and we do it in parallel?

I can see three independent scraping routines with different speeds:

Objective	Routine	Uses 'If-Modified-Since'	Relative speed
Docket discovery: find cases newly docketed since last run	Query URLs in sequential order of docket numbers	No	Slow
Reliable sources update: Find dockets guaranteed to have been updated	Scrape docket numbers from Docket Search, Orders, etc. then update those dockets	Yes	Faster
Recent update check: find dockets modified since last run	Similar to docket discovery but only operate on known docket numbers and use header filtering	Yes	Fastest

Docket discovery with no header filtering [edit: has been as bad as] 2-3 seconds per request for me, of which only 1s was me rate-limiting so they don't ban my IPA. Also, the 1.7 kBps download speeds I'm seeing on those requests are unlikely to be slow from my end, hence my comment about the possible gains from parallelization.

For reliable sources updates that do use header filtering, my guess is <= 10 minutes. Since yesterday I have been re-crawling dockets to save their 'Last-Modified' headers. When that's done I can estimate the number of dockets being updated on a given day. So far I think the largest number of dockets I have seen updated on the Orders List is ~400, plus there would also be some other dockets that had non-dispositive entries. I'm not fully parsing the dockets yet but that should be fast. So unless you're looking at scraping >1000 dockets per run, I think sub-10 minutes should be feasible.

The use of header filtering on recent update checks makes it subjectively faster, as does avoiding the delay resulting from being served the 'Not Found' page on unused docket numbers. I can also see further optimization from, say, prioritizing update requests by descending order of docket count, so that more active dockets are checked first in a given run.

Am I gathering that doing the brute force approach without doing search is viable? It seems more reliable to me.

I do think brute force is viable by itself. Not only that, I think it's a requirement for discovering when new dockets have been created, because of its reliability as you say.

The impact of brute force docket discovery can also be broken into smaller chunks. Once we have a sense of how many dockets might be populated on a given day, we know roughly how far ahead to run docket discovery given the sequential numbering patterns and knowledge of existing dockets. And once per day, say, you can run docket discovery on gaps in the numbering just as a precaution.

Is there a way to know that we can stop searching a particular case? Something to indicate that it's completed?

By scraping docket entries and finding a case disposition (e.g. from 22-175, {"Date":"Dec 12 2022","Text":"Petition DENIED."}). There does not appear to be any data item in dockets to indicate case status.

ralexx commented 4 months ago

A few data points on download speed using brute force.

With a no-frills ThreadPoolExecutor and five workers I can scrape (but not write to disk) about 6 dockets per second in docket discovery. And that's including a loop of three download retries when a response is missing the 'Last-Modified' header. This network traffic graph shows the same 250-docket search space being scraped; the first two bursts of activity use the thread pool, the third one is single-threaded.

mlissner commented 4 months ago

Nice. That sounds fast enough to be able to be as aggressive as we need to be.

ralexx commented 4 months ago

Questions about docket entries:

Do you want to scrape PDFs?
If so, does that include filing boilerplate such as the 'Certificate of Word Count' and 'Proof of Service' as in the example below?

[
    {
        "Description": "Petition",
        "File": "Petition Trevino v. Palmer.pdf",
        "DocumentUrl": "http://www.supremecourt.gov/DocketPDF/23/23-484/288797/20231103144956195_Petition%20Trevino%20v.%20Palmer.pdf"
    },
    {
        "Description": "Appendix",
        "File": "Appendix Trevino v. Palmer.pdf",
        "DocumentUrl": "http://www.supremecourt.gov/DocketPDF/23/23-484/288797/20231103145014794_Appendix%20Trevino%20v.%20Palmer.pdf"
    },
    {
        "Description": "Certificate of Word Count",
        "File": "Word Count Trevino v. Palmer.pdf",
        "DocumentUrl": "http://www.supremecourt.gov/DocketPDF/23/23-484/288797/20231103145051994_Word%20Count%20Trevino%20v.%20Palmer.pdf"
    },
    {
        "Description": "Proof of Service",
        "File": "Service Trevino v. Palmer.pdf",
        "DocumentUrl": "http://www.supremecourt.gov/DocketPDF/23/23-484/288797/20231103145104461_Service%20Trevino%20v.%20Palmer.pdf"
    }
]

mlissner commented 4 months ago

Do you want to scrape PDFs?

Emphatically yes!

If so, does that include

Yes. We want to have everything that's possible on the docket. You just never know when some scholar is going to get interested in a topic and it's usually easier to just get everything anyway!

ralexx commented 4 months ago

Two areas that I need guidance from others, please:

Data model for parsed docket output

@mlissner you asked for

Some database models, probably new ones, to store what we scrape.

When those data models are available, I need mappings from the SCOTUS docket back to the models' data items so I can finish the docket parser.

For reference, I merged the dict keys from 39,464 dockets that I've scraped so far; see below.

Merged dict keys from dockets

``` {'CaseNumber': None, 'bCapitalCase': None, 'RelatedCaseNumber': [], 'PetitionerTitle': None, 'RespondentTitle': None, 'DocketedDate': None, 'Links': None, 'LowerCourt': None, 'LowerCourtCaseNumbers': None, 'LowerCourtDecision': None, 'LowerCourtRehearingDenied': None, 'QPLink': None, 'ProceedingsandOrder': [{'Date': None, 'Text': None, 'Links': [{'Description': None, 'File': None, 'DocumentUrl': None}]}, {'Date': None, 'Text': None}], 'AttorneyHeaderPetitioner': None, 'Petitioner': [{'Attorney': None, 'IsCounselofRecord': None, 'Title': None, 'PrisonerId': None, 'Phone': None, 'Address': None, 'City': None, 'State': None, 'Zip': None, 'Email': None, 'PartyName': None}], 'AttorneyHeaderRespondent': None, 'Respondent': [{'Attorney': None, 'IsCounselofRecord': None, 'Title': None, 'PrisonerId': None, 'Phone': None, 'Address': None, 'City': None, 'State': None, 'Zip': None, 'Email': None, 'PartyName': None}], 'AttorneyHeaderOther': None, 'Other': [{'Attorney': None, 'IsCounselofRecord': None, 'Title': None, 'PrisonerId': None, 'Phone': None, 'Address': None, 'City': None, 'State': None, 'Zip': None, 'Email': None, 'PartyName': None}, {'Attorney': None, 'IsCounselofRecord': None, 'Title': None, 'PrisonerId': None, 'Phone': None, 'Address': None, 'City': None, 'State': None, 'Zip': None, 'Email': None, 'PartyName': None}], 'DiscretionaryCourtDecision': None, 'sJsonCreationDate': None, 'sJsonTerm': None, 'sJsonCaseNumber': None, 'sJsonCaseType': None, 'DiscretionaryCourtRehearing': None} ```

Parser object inheritance or conformance

SCOTUS dockets are sufficiently different from Juriscraper's other sources that I have used pacer.appellate_docket.AppellateDocketReport as a guide but not as a superclass. Do you prefer that I inherit from your existing objects, and if so, which one(s)? If not inheritance, is there an existing caller for the parser to conform to?

mlissner commented 4 months ago

I have used pacer.appellate_docket.AppellateDocketReport as a guide but not as a superclass.

Yeah, that's how I would do it too.

I need mappings from the SCOTUS docket back to the models' data items so I can finish the docket parser.

I think it might be good to see some code. The models are often pretty tricky, but maybe if we had your PR for scraping, we could use that to build the models. I'm also worried we've had so much conversation without a code check-in, which can sometimes be a mistake.

Do you want to try to get your Juriscraper PR at least drafted, and then we can go from there?

ralexx commented 4 months ago

OK

mlissner commented 3 months ago

I heard back from SCOTUS, finally. They don't want to talk, so it's up to us to figure out how we want to approach things (no surprise here, I guess).

freelawproject / juriscraper

Scrape United States Supreme Court (SCOTUS) #197

What DB tables are we going to fill with this data?

How are users interacting with this data?

About folder structure

About getting the docket list...

... from `Docket Search` page

... from PDFs

Goals

Architecture

Putting this all together...

Published docket information

Docket pages

Email notification

HTML pages

JSON pages

Example: docket 23-939

Docket numbering

Petition (regular) dockets

Application dockets

Motion dockets

"Orig." dockets

Progress update

Constrained by lack of a central index

When to look

Where to look

Docket Search

PDF extraction

Brute force

Upstream questions

Email notifications

Docket search

Data model for parsed docket output

Parser object inheritance or conformance

freelawproject / juriscraper

Scrape United States Supreme Court (SCOTUS) #197

What DB tables are we going to fill with this data?

How are users interacting with this data?

About folder structure

About getting the docket list...

... from Docket Search page

... from PDFs

Goals

Architecture

Putting this all together...

Published docket information

Docket pages

Email notification

HTML pages

JSON pages

Example: docket 23-939

Docket numbering

Petition (regular) dockets

Application dockets

Motion dockets

"Orig." dockets

Progress update

Constrained by lack of a central index

When to look

Where to look

Docket Search

PDF extraction

Brute force

Upstream questions

Email notifications

Docket search

Data model for parsed docket output

Parser object inheritance or conformance

... from `Docket Search` page