freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
522 stars 142 forks source link

For a Bulk Download, can you convert CSV to Apache Parquet format? #2702

Open sungkim11 opened 1 year ago

sungkim11 commented 1 year ago

For a Bulk Download, can you convert CSV to Apache Parquet format? It takes up less space and is so much easier to work with. You can partition by the court_id for even more convenience.

mlissner commented 1 year ago

We're open to it. More votes here would help us prioritize it, and I'm not opposed to a link to this issue in the bulk data documentation, soliciting feedback.

We've segmented by court_id in the past, and it made the process a lot harder (now you're making how many bulk data files??), but it's something we'd consider again.

If you're interested in implementing the change, that could work too.

sungkim11 commented 1 year ago

I have already converted both March and April to parquet format using snappy compression. Where should I upload these files?

Also, it would be helpful if you can provide a SQL script to generate an Opinions view by court_id. This would be very helpful. I must be doing something wrong since my count is off.

mlissner commented 1 year ago

We'd need this to be part of our monthly file generation, but if you wanted to post them on your website or whatever, by all means go for it.

Not sure what's up with your count, but we'd probably need to use a combination of Python and SQL to do an export per court. I imagine some sort of loop would do it...