freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
544 stars 150 forks source link

Bulk Data shell script not working #4222

Open khanken opened 3 months ago

khanken commented 3 months ago

load-bulk-data-2024-05-07.sh is not working for me:

  1. There are 80 tables from schema-2024-05-07.sql; there are quite fewer csv files than the tables. Many tables do not have respect csv files to load its data.
  2. load-bulk-data-2024-05-07.sh tries to load some csv files to tables that do not exist in schema-2024-05-07.sql. For example, ERROR: relation "public.disclosures_financialdisclosure" does not exist.
  3. The shell script often loads the tables with a foreign key that is referencing a table which has not been loaded yet. The sequence of the scripts need to be fixed.
  4. file "people-db-races-2024-05-06.csv.bz2" is empty. Since it appears to be a lookup table, I tried to get the data from models.py. But the primary key in the table is integer, the definition in the models.py is a char. Could we please upload the correct file to S3?

Thank you so much for all you have done! I really appreciate it!

mlissner commented 3 months ago

Thanks for reporting this. We don't export all the tables, but we do figure it's useful to have a fairly complete schema. The ordering is definitely an issue. If that's something you're game to fix, we'd welcome that.

There's a PR from a few minutes ago that may have some of these fixes too: #4223.

I think it fixes the missing race table, and the missing schema files. The author mentioned the issue with the foreign keys being out of order, but I don't think their PR has the fix for that yet.

hopperj commented 3 months ago

@khanken my MR should fix your first 3 issues, although I don't think it will help with the 4th. From what I can tell the load-bulk-data-2024-05-07.sh script does tables in order of how they are defined in the array, so I have ordered them in a way that shouldn't trigger any FK errors when the load-bulk-data-2024-05-07.sh script is run.

khanken commented 3 months ago

Thank you so much for the quick response! It is understandable you do not provide all the data.

I will be more than happy to fix the shell script if I I have all the schemas required to load the data. I am stuck loading search_sockets table. I had restarted the process with no success so far.

I was wondering if I needed a better computer for large tables. So I asked a friend to spin up a VM with RHEL 9 on his server yesterday, and I am planning to install postgres on his server and give it a try.

But I think he is tied up by the crow strike outrage. He texted me that he "had a busy day" at 7 AM this morning. I am guessing he is busy putting out fire? I don't know when I can get a new server to run this.

It is not a difficult fix at all. I could have fixed it by looking at the schemas. But you know how it goes with scripts. I would like to run everything successfully before commit and sharing the fix.

I am waiting on the hardware right now. Meanwhile I have to move on to other part of my project. I will share with you soon as I can get access to a better server for my database.

Thanks!

Kelly

On Fri, Jul 19, 2024, 09:55 Mike Lissner @.***> wrote:

Thanks for reporting this. We don't export all the tables, but we do figure it's useful to have a fairly complete schema. The ordering is definitely an issue. If that's something you're game to fix, we'd welcome that.

There's a PR from a few minutes ago that may have some of these fixes too:

4223 https://github.com/freelawproject/courtlistener/pull/4223.

I think it fixes the missing race table, and the missing schema files. The author mentioned the issue with the foreign keys being out of order, but I don't think their PR has the fix for that yet.

— Reply to this email directly, view it on GitHub https://github.com/freelawproject/courtlistener/issues/4222#issuecomment-2239402043, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2Q3ULEPZGTBQYXZXKU52B3ZNESEJAVCNFSM6AAAAABLDYHR22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZZGQYDEMBUGM . You are receiving this because you authored the thread.Message ID: @.***>

khanken commented 3 months ago

Sorry, I am not able to access my personal computer during the day. I can only check emails. I did not see this email. Thank you so much for the fix! And I want you know that this project and all of you guys work are much appreciated!

P.S. What is the minimum hardware requirements on running this database?

On Fri, Jul 19, 2024, 10:51 hopperj @.***> wrote:

@khanken https://github.com/khanken my MR should fix your first 3 issues, although I don't think it will help with the 4th. From what I can tell the load-bulk-data-2024-05-07.sh script does tables in order of how they are defined in the array, so I have ordered them in a way that shouldn't trigger any FK errors when the load-bulk-data-2024-05-07.sh script is run.

— Reply to this email directly, view it on GitHub https://github.com/freelawproject/courtlistener/issues/4222#issuecomment-2239499151, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2Q3ULCDFEWIE36VVBPABUDZNEYZXAVCNFSM6AAAAABLDYHR22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZZGQ4TSMJVGE . You are receiving this because you were mentioned.Message ID: @.***>

mlissner commented 3 months ago

P.S. What is the minimum hardware requirements on running this database?

I think it's around 500GB, but honestly, we have lots of other stuff in our DB, so it's hard to say. It takes a big machine though.

khanken commented 3 months ago

I have been having issues loading large files. I am wondering if we could chunk the data to under 2G per file when exporting? It is not easy to chunk the CSV files. The rows might be broken in 2 separate files.

Or any suggestions on loading large files?

On Mon, Jul 22, 2024, 10:12 Mike Lissner @.***> wrote:

P.S. What is the minimum hardware requirements on running this database?

I think it's around 500GB, but honestly, we have lots of other stuff in our DB, so it's hard to say. It takes a big machine though.

— Reply to this email directly, view it on GitHub https://github.com/freelawproject/courtlistener/issues/4222#issuecomment-2243203539, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2Q3ULCEE4XHZAI27ARYKADZNUONRAVCNFSM6AAAAABLDYHR22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBTGIYDGNJTHE . You are receiving this because you were mentioned.Message ID: @.***>

mlissner commented 3 months ago

You can chunk on your side, if that's helpful. I think we'd prefer it that way.