BillPetti / baseballr

A package written for R focused on baseball analysis. Currently in development.
billpetti.github.io/baseballr
Other
363 stars 99 forks source link

Bug matchup.postOnFirst.id variable #263

Open Clem68 opened 2 years ago

Clem68 commented 2 years ago

Description of the bug When I get data from the mlb_pbp() function, there seems to be a problem in the definition of the variables indicating which player is present on a given base (matchup.postOnFirst.id, matchup.postOnSecond.id and matchup.postOnThird.id ) at given pitches for certain games. Normally these variables indicate who is present on a given base at the end of a batter's turn (ie, when he is out or goes to some base) and is NA otherwise. But for some entries, it is displaying the same player ID repeatedly even when it should not. For example, in some games, see the example below, the player on first base remains on first base even when there is a half-inning change.

To Reproduce Steps to reproduce the behavior: data = mlb_pbp(263811) data= arrange(data, about.atBatIndex, index ) #Get the pitches in the right order. The bug appears at (among others) atBatIndex 47 to 52.

Here is a screenshot of a excel table where you can visually see the problem. Capture d’écran 2022-07-26 152607

Thanks a lot for your work!

saberpowers commented 1 year ago

I think this might be the source of the problem because some columns are being tidyr::fill'd up when they shouldn't be. For the one game I looked at (661400), "matchup.postOnThird.id" is included in this span of columns.

saiemgilani commented 1 year ago

Oops

saberpowers commented 1 year ago

Putting on my maintainer hat for a second, it looks like that dplyr::fill has been there as far back as the commit history goes. It's not clear to me why it's there, and I think it's undesirable because it happens after a left join between pitches (plays in the code) and plate appearances (at_bats in the code), meaning that there are a lot of NAs that you'd want to leave as such. That having been said, if I were the maintainer of the code, I would be cautious about removing it because I don't know what users have come to expect the existing behavior (again, I don't know why it's there in the first place).