BillPetti / baseballr

A package written for R focused on baseball analysis. Currently in development.
billpetti.github.io/baseballr
Other
369 stars 101 forks source link

Speeding up slow Statcast Database query #202

Open jestarr opened 3 years ago

jestarr commented 3 years ago

I followed the post here(https://billpetti.github.io/2020-05-26-build-statcast-database-rstats-version-2.0/) in order to create a PostgreSQL Statcast database. I also followed the instructions to index. Simple summary functions are taking 93 seconds. Is there a way to fix/improve the Statcast database performance?

tictoc::tic()
statcast %>% 
  filter(game_year >= 2019 & !is.na(pitch_type) & !is.na(release_speed)) %>%  
  group_by(game_year, pitcher) %>% 
  summarise(velo = mean(release_speed, na.rm = TRUE)) %>% 
  arrange(desc(velo)) %>% 
  collect()
tictoc::toc()
93.63 sec elapsed
BillPetti commented 3 years ago

This isn't really an issue with baseballr, but when I run this I get a result in 24 seconds:

> library(baseballr)
> library(tidyverse)
> statcast_db <- myDBconnections::connect_Statcast_postgreSQL()
> tictoc::tic()
> tbl(statcast_db, 'statcast') %>% 
+   filter(game_year >= 2019 & !is.na(pitch_type) & !is.na(release_speed)) %>%  
+   group_by(game_year, pitcher) %>% 
+   summarise(velo = mean(release_speed, na.rm = TRUE)) %>% 
+   arrange(desc(velo)) %>% 
+   collect()
# A tibble: 2,347 x 3
# Groups:   game_year [3]
   game_year pitcher  velo
       <int>   <int> <dbl>
 1      2020  676979  98.0
 2      2021  661403  97.5
 3      2019  661403  97.4
 4      2021  622251  97.2
 5      2021  621237  97.1
 6      2021  594798  96.9
 7      2021  676206  96.8
 8      2019  621237  96.7
 9      2021  669358  96.7
10      2019  594027  96.6
# … with 2,337 more rows
> tictoc::toc()
24.632 sec elapsed

I have my data in a PostGreSQL 12 database, so that might account for some of the difference you are experiencing.

jestarr commented 3 years ago

I have mine in a PostgreSQL 13 database. Could it be issues with the indexing? I followed your code in the blog post. Otherwise, could it be computer speed/RAM? I didn't think that could account for such a significant difference.

BillPetti commented 3 years ago

Have you changed or appended the statcast table in any way after you created the indexes? If so, you need to re-index whenever you make a change. You can check to see if they are still live with this code (change tablename as appropriate for your database):

SELECT
    tablename,
    indexname,
    indexdef
FROM
    pg_indexes
WHERE
    schemaname = 'public'
    and 
    tablename = 'statcast'
ORDER BY
    tablename,
    indexname;

Here's what I currently have:

tablename|indexname             |indexdef                                                                    |
---------|----------------------|----------------------------------------------------------------------------|
statcast |statcast_batter_index |CREATE INDEX statcast_batter_index ON public.statcast USING btree (batter)  |
statcast |statcast_game_year    |CREATE INDEX statcast_game_year ON public.statcast USING btree (game_year)  |
statcast |statcast_gamepk       |CREATE INDEX statcast_gamepk ON public.statcast USING btree (game_pk)       |
statcast |statcast_index        |CREATE INDEX statcast_index ON public.statcast USING btree (game_date)      |
statcast |statcast_pitcher_index|CREATE INDEX statcast_pitcher_index ON public.statcast USING btree (pitcher)|
statcast |statcast_type         |CREATE INDEX statcast_type ON public.statcast USING btree (type)            |
jestarr commented 3 years ago

Everything was identical. I re-indexed (I could have sworn that I made no changes after indexing) and now I'm at 7 seconds! Thanks! Quick question on current year scraping. Do you have code that outlines how you bind current year to historical?

BillPetti commented 3 years ago

Not sure what you mean. Once you have new data to upload to your database you case something like this (assuming all columns are properly cast to align with the database):

dbWriteTable(database_connection, 
                       "<name of your table>",
                       payload_to_append, 
                       append = TRUE, 
                       overwrite = FALSE, 
                       row.names = FALSE)
jestarr commented 3 years ago

I guess similar to how you mapped out how to build a historical statcast database, do you have an example/code that shows how the steps/functions for how you append the historical statcast database to include 2021 on a daily basis?