freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
340 stars 98 forks source link

Map state oral arguments sources to download #1047

Open grossir opened 1 week ago

grossir commented 1 week ago

Ordered by population:

The Commonwealth Court’s Internal Operating Procedures and the Pennsylvania Rules of Judicial Administration prohibit recording of oral arguments conducted and livestreamed by advanced video communication technology. See Section 502 of the Internal Operating Procedures of the Commonwealth Court, 210 Pa. Code § 69.502 (permitting only the recording by the Pennsylvania Cable Network (PCN) of en banc proceedings for future broadcast); Pennsylvania Rule of Judicial Administration 1910, Pa.R.J.A. 1910 (relating to broadcasting, recording and photography in the courtroom). See generally Section 124 of the Internal Operating Procedures of the Commonwealth Court, 210 Pa. Code § 69.124 (relating to video or teleconference proceedings). Violation of this directive may result in the imposition of sanctions.

Following the mention of "Pennsylvania Cable Network", I did found a courts section on that website with videos of oral arguments; but I can't find case data to link the audio properly

mlissner commented 1 week ago

Looking good, @grossir! Now I have the next hard question: How many hours or files, approximately, on each — or put another way, where do we start? The other question is what do we do about video? We could probably start storing it, but we'd want to optimize/normalize the file types, and price out the storage costs, since they might start to matter....

Looks like this will be a big project.

grossir commented 1 week ago

I will try to calculate seconds available where possible, but I think the number of files is a decent proxy. Most sites do not list any oral arguments statistics, and I would have to implement basically a scraper to get the numbers.

I think the best way to start is to implement the sources that match our current model (audio files with case metadata), needing the least effort: we will just implement the scraper / backscraper.

Texas' courts tex and texapp hold a lot of data. Then va, tenn, ind and indtc and, a little trickier, nj, for the courts I have mapped so far

To include video would take us more time, having to implement model and doctor changes; changes to the frontend to watch the videos; and having to calculate storage costs. Do we want that, anyway? Why not extract the audio from the video? Related: #44

After scraping the courts that self host their audio, I think we should work on the ones that upload to Youtube, since they all have a related scraping / processing step. Luckily in this step we would scrape some of the big courts, like ny and fla

Finally, the ones that "self" host their videos, or use a provider like granicus (cal is one of those)

mlissner commented 1 week ago

That all sounds good. Start with the easy stuff and then move to the trickier stuff.

I'm not sure what we should do about video. Long term, probably the right thing is to extract audio from it, and to also host the video, so API users can choose if they want audio or video.

Hosting video is going to be expensive and complex, so maybe step one is just to scrape and store video with a cheap storage class, and step two will be to actually figure out how to serve it.

But for now, yes, let's finish the survey, and when we're ready, we can start with scraping audio, then do video in a second phase.