droher / boxball

Prebuilt Docker images with Retrosheet's complete baseball history data for many analytical frameworks. Includes Postgres, cstore_fdw, MySQL, SQLite, Clickhouse, Drill, Parquet, and CSV.
Apache License 2.0
117 stars 16 forks source link

Billy Herman 1934 allstar_full_pkey unique violation in 2020 postgres and mysql images, value in park.csv exceeds varcar(45) in 2020 postgres_cstore image #52

Closed prrapo closed 4 years ago

prrapo commented 4 years ago

Tested a few of the new images and three of the containers of were exiting shortly after launching. Removed the image and cleared the cache with each of these three and retried but got same error codes. Outputted their logs, see below, and saw these issues. I only managed to try postgres, postres_cstore, mysql, and clickhouse. Clickhouse worked with no issues. Docker version 19.03.8, build afacb8b running on CentOS Linux release 7.8.2003 (Core). These could probably be separated into two different issues but I'm grouping them together because they're both a result of the 2020 updates.

Postgres Log 2020-05-02 05:15:21.721 UTC [42] ERROR: duplicate key value violates unique constraint "allstar_full_pkey" 2020-05-02 05:15:21.721 UTC [42] DETAIL: Key (player_id, year_id, game_num)=(hermabi01, 1934, 0) already exists. 2020-05-02 05:15:21.721 UTC [42] CONTEXT: COPY allstar_full, line 69 2020-05-02 05:15:21.721 UTC [42] STATEMENT: COPY baseballdatabank.allstar_full(player_id, year_id, game_num, game_id, team_id, lg_id, gp, starting_pos) FROM PROGRAM 'zstd --rm -cd /data/baseballdatabank/allstar_full.csv.zst' WITH (FORMAT CSV, FORCE_NULL(player_id, year_id, game_num, game_id, team_id, lg_id, gp, starting_pos)); psql:/docker-entrypoint-initdb.d/postgres.sql:535: ERROR: duplicate key value violates unique constraint "allstar_full_pkey" DETAIL: Key (player_id, year_id, game_num)=(hermabi01, 1934, 0) already exists. CONTEXT: COPY allstar_full, line 69

MySQL Log /usr/local/bin/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/mysql.sql ERROR 1062 (23000) at line 536: Duplicate entry 'hermabi01-1934-0' for key 'PRIMARY'

Note: An issue exists on baseballdatabank for this Billy Herman record. He played twice in the game. 🤷‍♂️ https://github.com/chadwickbureau/baseballdatabank/issues/104

Postgres_cstore Log /data/baseballdatabank/parks.csv.zst: 11602 bytes 2020-05-02 05:23:59.954 UTC [40] ERROR: value too long for type character varying(45) 2020-05-02 05:23:59.954 UTC [40] STATEMENT: COPY baseballdatabank_parks(park_id, park_name, park_alias, city, state, country) FROM PROGRAM 'zstd --rm -cd /data/baseballdatabank/parks.csv.zst' WITH (FORMAT CSV, FORCE_NULL(park_id, park_name, park_alias, city, state, country)); psql:/docker-entrypoint-initdb.d/postgres_cstore_fdw.sql:527: ERROR: value too long for type character varying(45)

double-dose-larry commented 4 years ago

Experiencing the same issue (I think)

2020-05-04 18:05:02.959 UTC [41] STATEMENT: COPY baseballdatabank_parks(park_id, park_name, park_alias, city, state, country) FROM PROGRAM 'zstd --rm -cd /data/baseballdatabank/parks.csv.zst' WITH (FORMAT CSV, FORCE_NULL(park_id, park_name, park_alias, city, state, country));

psql:/docker-entrypoint-initdb.d/postgres_cstore_fdw.sql:527: ERROR: value too long for type character varying(45)

prrapo commented 4 years ago

Looked through and found the specific offender for the postgres_cstore issue. The aliases section for Guaranteed Rate Field (White Sox) is 50+ characters long. I tried changing the DDL to varchar(55) then committing a new image and running that but it never got past the parks.csv.zst file load even though it did not send an error message this time. I will add the disclaimer though that I am not super familiar with docker's workings.

droher commented 4 years ago

Quick update here: I've fixed the bugs above, but there are some Retrosheet duplication issues to deal with. Hoping to have it all ironed out this week.

droher commented 4 years ago

Should be good now, I ended up having to fork/edit the data manually. Please reopen if you're still having problems.