Open JRice15 opened 8 months ago
1) See this issue here - https://github.com/droher/boxball/issues/68. Game type was not included in the raw files until this summer, and there's no released version of the game parsing software that creates a column out of it. Hopefully I can manage to get it into the game file on the next update. I made a custom CSV that I linked in that issue to hold you over in the meantime. (https://github.com/droher/boxball/files/11984000/game_types.csv)
2) It sounds like this was an issue for you primarily as a function of 1), but in case you're still interested in using them, the explanation is here: https://www.retrosheet.org/schedule/index.html Those are the original schedules before postponements and rescheduling. A valuable historical dataset to have on its own, but not for this particular purpose.
Thanks, that csv will work for my purposes. I really appreciate the project!
Maybe a simple README file could be included with the data files that makes clear what each file is representing, so others don't run into the same confusion. The fact that 'postponement_indicator' and 'makeup_dates' columns exist in schedule.parquet
and are all-null as far as I can tell made me assume that the problem was not caused of rescheduling, but maybe those columns are incorrect or incomplete.
The readme has a link to the schema doc, perhaps not prominently enough, where you can find the schedule doc. I just pulled the schedule.parquet
file down from OneDrive and see plenty of non-null values in those columns, but there is a 2020 bug.
The error is coming from the extract step - 2020 is the only year with two different schedule files, one for the original and one for the revised, and the glob is picking up both. I'll fix that in the code and the data should be correct in the next release. Thanks for the catch! If this is the only 2020-related bug in the data I'll be pleasantly surprised.
You're right, I amend my statement to: 'postponement_indicator' is all-null in 2021 and 2022 only. When grouping the entire schedules table by year, I see zero postponements for 2021 and 2022, while every other year has at least 20 or so.
Yep, that's definitely a bug - this time from gaps in the data that have since been corrected by Retrosheet. Those years will be filled on the next release (which I'm guessing I'll have up here in December).
Great, thanks for your diligence and quick responses!
Describe the bug I downloaded the parquet files directly from the OneDrive link included in the repo's readme, and have been reading them with pyarrow and pandas. In digging though the data (2022 only, so far), I discovered two problems, one of which led me to the other.
First (and maybe more of a feature request than a bug) is the fact that, best I can tell, there is no easy way to tell whether games are regular season or postseason/allstar/other from any of the parquet files (I would particularly expect
game.parquet
,gamelog.parquet
, orschedule.parquet
to have an indicator column for this, but I do not see one).Second appears to be more of a bug. Contrary to no. 1 above,
schedule.parquet
only seems to include regular season games. Great, we can simply filtergame.parquet
and other files by whether or not the game exists inschedule.parquet
, right? Nope,schedule.parquet
seems to list games that never actually occurred. As an example:schedule.parquet
includes a game MIL @ CHN 2022-04-08, not part of a double header. However,games.parquet
(as well as baseball-reference and other sources) tell us that no such game exists! MIL @ CHN games occurred on 4/7/22 and 4/9/22, but not 4/8/22. I find 88 of these 'phantom games' inschedule.parquet
for 2022. And it's not a byproduct of the pandemic or lockout, I found these inconsistencies in every year I've looked at as far back as 2000.To Reproduce Steps to reproduce the behavior:
schedule.parquet
andgame.parquet
as described above.Expected behavior Games will be consistent across files, and a column easily delineates whether games are regular season or not.
I am entirely open to the fact that the files are exactly as intended and I am just missing something that explains the discrepancies, let me know if that is the case. Thanks!