Bulk Data Loading - Githubissues

jasonzoladz commented 6 months ago

I am attempting to load bulk data files into postgres using the load-bulk-data-2024-03-12.sh script. (Note: I've ensured that the corresponding schema and csv files for 2024-03-11 are downloaded and available to postgres.)

The load-bulk-data-2024-03-12.sh script runs fine until it gets to dockets-2024-03-11.csv -- the first large csv file. During the copy of dockets-2024-03-11.csv I get errors like:

Loading dockets-2024-03-11.csv to database

ERROR:  index "search_dock_court_i_a043ae_idx" contains corrupted page at block 40449
HINT:  Please REINDEX it.
CONTEXT:  COPY search_docket, line 8109127

or

ERROR:  index "search_docket_c69e55a4" contains corrupted page at block 8303
HINT:  Please REINDEX it.
CONTEXT:  COPY search_docket, line 2110452

Is this a known issue? If so, will this be fixed and when will the next bulk data generation be performed? If not, any ideas on how I might fix this?

In the meantime, is there an (outdated) archive of a previous JSON dump (#1983) sitting in an S3 bucket somewhere? (Frankly, that might be the most useful thing to me because I'm going to need to pull the opinions from the database and process them anyway.)

(Aside: I noticed that the generation schedule contemplates that "bulk data files are regenerated on the last day of every month" yet the bulk data does not reflect that frequency.)

Use case:

I’m a lawyer and I am exploring creating text embeddings for the opinions to improve retrieval of relevant case law -- combining Approximate Nearest Neighbor search with pre-or-post-filtering -- for use as part of a RAG pipeline. (I've seen some good initial results using a subset of the Harvard CAP static files.)

Once those embeddings are generated, I am happy to contribute the embeddings (gte-large-en-v1.5) to Free Law if you might find them useful. Further, if my project ever becomes commercialized, I'd love to explore the possibility of contracting with Court Listener to obtain the daily update stream.

mlissner commented 6 months ago

Hi @jasonzoladz. I haven't seen that issue with Postgresql indexes before, but we did recently create some new and fancy indexes over in https://github.com/freelawproject/courtlistener/issues/3543. That was back in December though, so I sort of hope that wouldn't be affecting you.

Did you try Googling this?

In the meantime, is there an (outdated) archive of a previous JSON dump (https://github.com/freelawproject/courtlistener/issues/1983) sitting in an S3 bucket somewhere?

There might be, I really don't remember, but the reason we stopped generating that was because it was taking too much horsepower to do on a regular basis. Sorry!

(Aside: I noticed that the generation schedule contemplates that "bulk data files are regenerated on the last day of every month" yet the bulk data does not reflect that frequency.)

Hm, the script ran, but it looks like it must have crashed. That's a bummer. I can try to run it again manually.

Once those embeddings are generated, I am happy to contribute the embeddings (gte-large-en-v1.5) to Free Law if you might find them useful.

I think we would, yes! We have a plan to make a vector search engine, and I imagine such embeddings would be an important part of that: freelawproject/foresight#8

Further, if my project ever becomes commercialized, I'd love to explore the possibility of contracting with Court Listener to obtain the daily update stream.

That's great! Sounds like we should talk. Want to grab a spot on my calendar and we can go over how that works? https://calendly.com/flp-mike/

mlissner commented 6 months ago

The manual dump of the data is underway. I'll try to keep an eye on it. Sorry it didn't work automatically. I'm not sure what's up with that, but I suspect when I run it manually I'll find out.

jasonzoladz commented 6 months ago

@mlissner, thanks so much for the prompt reply.

Did you try Googling this?

I did (and spent six hours trying to get it to work). However, I'm no postgres wizard.

There might be, I really don't remember, but the reason we stopped generating that was because it was taking too much horsepower to do on a regular basis. Sorry!

If there's a public S3 bucket with an old set of JSON files, I'd love to make a replica.

We have a plan to make a vector search engine, and I imagine such embeddings would be an important part of that[.]

My plan is to (at least initially) store the embeddings in a new table of a local copy of the courtlistener database; see pgvector. Once I do so, I will let you know. (I believe I have selected the embeddings with the best cost-to-retrieval-quality ratio -- gte-large-en-v1.5 -- but I want to evaluate a few more. This article discusses how costs vary substantially based on the choice of embedding model and compute, especially at the scale of CL.) I see that CL uses elastic search and I know they have an offering.

That's great! Sounds like we should talk. Want to grab a spot on my calendar and we can go over how that works?

I won't take up your face time until I'm much closer to something real. I hope (and plan) that will be sooner than later.

The manual dump of the data is underway. I'll try to keep an eye on it. Sorry it didn't work automatically. I'm not sure what's up with that, but I suspect when I run it manually I'll find out.

Please let me know how this goes.

mlissner commented 5 months ago

I did (and spent six hours trying to get it to work). However, I'm no postgres wizard.

So, one fix is to just drop the index. That should get you moving forward again, and if you need it again, you can create it once the data is ingested. One thing though: The first link I read about this issue suggests that this might be a disk corruption issue. If that's right, you might have bigger problems (but the internet is often full of bad advice!).

Please let me know how this goes.

The manual dump is complete!

mlissner commented 1 month ago

Seems like we're all good here?

jasonzoladz commented 1 month ago

Yep. Thanks Mike! Feel free to close.

freelawproject / courtlistener

Bulk Data Loading #4024