droher / boxball

Prebuilt Docker images with Retrosheet's complete baseball history data for many analytical frameworks. Includes Postgres, cstore_fdw, MySQL, SQLite, Clickhouse, Drill, Parquet, and CSV.
Apache License 2.0
120 stars 16 forks source link

retrosheet_daily table missing game.source #62

Open segiddins opened 3 years ago

segiddins commented 3 years ago

cwdaily outputs daily lines for each player, which include the source for the game information. For games with multiple sources, there will be multiple daily entries for a given (player_id, game_id) tuple, and right now there's no column that can be used to disambiguate.

E.g. select * from retrosheet_daily where game_dt = '1943-06-19' and player_id = 'mackr101';

yields 5 rows for 2 games (two halves of a double header), which each game having a box score & deduced game, according to https://raw.githubusercontent.com/chadwickbureau/retrosplits/master/daybyday/playing-1943.csv. I'm not sure why mack in particular has 2 deduced game entries for CHA194306191, but that's probably an issue in chadwick

droher commented 3 years ago

Hmm, I put in some protection against this problem here, but looks like it's not working: https://github.com/droher/boxball/blob/72c7bc05993968b0897c1bcf9f662ed1e82b2776/extract/parsers/retrosheet.py#L61

I'll try to patch. Adding a general source column across all of these tables would be a great idea. For now, I do have an extra retrosheet_deduced_game table that you can join on to find which games have deduced entries -- I know that doesn't help with disambiguation, though.

droher commented 2 years ago

This hasn't been resolved in the code, but I've manually removed the duplicated games from my Retresheet fork, so the newly published version should be free of this bug.