freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
529 stars 144 forks source link

Order of bulk data csv fields makes it difficult (impossible?) to import #4241

Closed mlissner closed 2 weeks ago

mlissner commented 1 month ago

Lots of people are using the bulk data, so I'm not sure how this hasn't been fixed, but one of our users is reporting that our CSVs cannot be imported using the schema we provide because the columns in the CSVs are in a different order than the schema.

I think we should fix that.

I imagine a related issue will be that the schema has all of our columns while the CSV has a subset.

I'm not sure the fix for this. Maybe we need to provide better schemas? Maybe we need to re-order the CSV fields properly?

The error they get is:

 Conversion Error: CSV Error on Line: 2

Original Line: "10838944","2019-01-21 09:17:15.707272+00","2022-02-04 21:50:01.416445+00","9","","Susan Illston","","","2021-01-21 17:13:19.363539+00",,,,,,"1998-09-03","1999-09-14","1999-09-14","","Advent Software Inc. v. Stratum Business","","advent-software-inc-v-stratum-business","3:98-cv-03398","9803398","119636","","840 Trademark","","Federal question","","","","","","",,"t","2021-01-21 17:13:19.309558+00","0",,"f",,"1588","cand","19579836",,

Error when converting column "source". Could not convert string "9" to 'DATE'
rbpasker commented 1 month ago

I reported this.

since I couldn't reorder the 6MM columns in the CSV file, I reordered the ordered of the columns in CREATE TABLE to match the order of the columns in the CSV (see below)

the alternative solution would be to re-order the columns in the script that creates the bulk data files to match the order in the DDL: https://github.com/freelawproject/courtlistener/blob/8c7f6743f2eea0d18d35a7cc2390eb963bb8b718/scripts/make_bulk_data.sh#L45

CREATE TABLE dockets (
id integer ,
date_created timestamp with time zone ,
date_modified timestamp with time zone ,
source smallint ,
appeal_from_str text ,
assigned_to_str text ,
referred_to_str text ,
panel_str text ,
date_last_index timestamp with time zone,
date_cert_granted date,
date_cert_denied date,
date_argued date,
date_reargued date,
date_reargument_denied date,
date_filed date,
date_terminated date,
date_last_filing date,
case_name_short text ,
case_name text ,
case_name_full text ,
slug character varying(75) ,
docket_number text,
docket_number_core character varying(20) ,
pacer_case_id character varying(100),
cnature_of_suit character varying(1000) ,
cause character varying(2000) ,
jury_demand character varying(500) ,
jurisdiction_type character varying(100) ,
appellate_fee_status text ,
appellate_case_type_information text ,
mdl_status character varying(100) ,
filepath_local character varying(1000) ,
filepath_ia character varying(1000) ,
filepath_ia_json character varying(1000) ,
ia_upload_failure_count smallint,
ia_needs_upload boolean,
ia_date_first_change timestamp with time zone,
view_count integer ,
date_blocked date,
blocked boolean ,
appeal_from_id character varying(15),
assigned_to_id integer,
court_id character varying(15) ,
idb_data_id integer,
originating_court_information_id integer,
referred_to_id integer
);
rbpasker commented 1 month ago

maybe add a COPY FROM script restore_bulk_data.sh with the proper column order?

https://www.postgresql.org/docs/current/sql-copy.html

If a column list is specified, COPY TO copies only the data in the specified columns to the file. For COPY FROM, each field in the file is inserted, in order, into the specified column. Table columns not specified in the COPY FROM column list will receive their default values.

mlissner commented 1 month ago

Actually, wouldn't load_bulk_data.sh (in the bulk directory) do this?

rbpasker commented 1 month ago

Ok, well that would do it.

I followed these instructions:

https://www.courtlistener.com/help/api/bulk-data/#formats

And I guess I missed the fact that it lists the column names.

This could be cleared up by better documentation.

mlissner commented 1 month ago

That's generally good, so it sounds like the real issue here is a documentation issue.