droher / boxball

Prebuilt Docker images with Retrosheet's complete baseball history data for many analytical frameworks. Includes Postgres, cstore_fdw, MySQL, SQLite, Clickhouse, Drill, Parquet, and CSV.
Apache License 2.0
117 stars 16 forks source link

Inconsistencies in Parquet download files #73

Open JRice15 opened 8 months ago

JRice15 commented 8 months ago

Describe the bug I downloaded the parquet files directly from the OneDrive link included in the repo's readme, and have been reading them with pyarrow and pandas. In digging though the data (2022 only, so far), I discovered two problems, one of which led me to the other.

  1. First (and maybe more of a feature request than a bug) is the fact that, best I can tell, there is no easy way to tell whether games are regular season or postseason/allstar/other from any of the parquet files (I would particularly expect game.parquet, gamelog.parquet, or schedule.parquet to have an indicator column for this, but I do not see one).

  2. Second appears to be more of a bug. Contrary to no. 1 above, schedule.parquet only seems to include regular season games. Great, we can simply filter game.parquet and other files by whether or not the game exists in schedule.parquet, right? Nope, schedule.parquet seems to list games that never actually occurred. As an example: schedule.parquet includes a game MIL @ CHN 2022-04-08, not part of a double header. However, games.parquet (as well as baseball-reference and other sources) tell us that no such game exists! MIL @ CHN games occurred on 4/7/22 and 4/9/22, but not 4/8/22. I find 88 of these 'phantom games' in schedule.parquet for 2022. And it's not a byproduct of the pandemic or lockout, I found these inconsistencies in every year I've looked at as far back as 2000.

To Reproduce Steps to reproduce the behavior:

  1. Compare schedule.parquet and game.parquet as described above.

Expected behavior Games will be consistent across files, and a column easily delineates whether games are regular season or not.

I am entirely open to the fact that the files are exactly as intended and I am just missing something that explains the discrepancies, let me know if that is the case. Thanks!

droher commented 8 months ago

1) See this issue here - https://github.com/droher/boxball/issues/68. Game type was not included in the raw files until this summer, and there's no released version of the game parsing software that creates a column out of it. Hopefully I can manage to get it into the game file on the next update. I made a custom CSV that I linked in that issue to hold you over in the meantime. (https://github.com/droher/boxball/files/11984000/game_types.csv)

2) It sounds like this was an issue for you primarily as a function of 1), but in case you're still interested in using them, the explanation is here: https://www.retrosheet.org/schedule/index.html Those are the original schedules before postponements and rescheduling. A valuable historical dataset to have on its own, but not for this particular purpose.

JRice15 commented 8 months ago
  1. Thanks, that csv will work for my purposes. I really appreciate the project!

  2. Maybe a simple README file could be included with the data files that makes clear what each file is representing, so others don't run into the same confusion. The fact that 'postponement_indicator' and 'makeup_dates' columns exist in schedule.parquet and are all-null as far as I can tell made me assume that the problem was not caused of rescheduling, but maybe those columns are incorrect or incomplete.

droher commented 8 months ago

The readme has a link to the schema doc, perhaps not prominently enough, where you can find the schedule doc. I just pulled the schedule.parquet file down from OneDrive and see plenty of non-null values in those columns, but there is a 2020 bug.

The error is coming from the extract step - 2020 is the only year with two different schedule files, one for the original and one for the revised, and the glob is picking up both. I'll fix that in the code and the data should be correct in the next release. Thanks for the catch! If this is the only 2020-related bug in the data I'll be pleasantly surprised.

JRice15 commented 8 months ago

You're right, I amend my statement to: 'postponement_indicator' is all-null in 2021 and 2022 only. When grouping the entire schedules table by year, I see zero postponements for 2021 and 2022, while every other year has at least 20 or so.

droher commented 8 months ago

Yep, that's definitely a bug - this time from gaps in the data that have since been corrected by Retrosheet. Those years will be filled on the next release (which I'm guessing I'll have up here in December).

JRice15 commented 8 months ago

Great, thanks for your diligence and quick responses!