alanrkessler / savantscraper

Using Python to add Baseball Savant data to a SQLite database
GNU General Public License v3.0
52 stars 16 forks source link

Timeout Issue #2

Closed alanrkessler closed 7 years ago

alanrkessler commented 7 years ago

Looks like we are hitting some query timeouts @jpetrich

alanrkessler commented 7 years ago

Specifically even the link I am pasting is returning a timeout CSV

https://baseballsavant.mlb.com/statcast_search/csv?all=true &hfGT=R%7CPO%7CS%7C&hfPR= &season=2008&player_type=batter &hfOuts=0%7C&team=TEX&position=&hfRO= &home_road=Road&hfInn=8%7C&min_pitches=0 &pitcher_throws=R&min_results=0&group_by=name&sort_col=pitches &player_event_sort=start_speed &sort_order=desc&min_abs=0&xba_gt=&xba_lt=&px1=&px2=&pz1=&pz2=&ss_gt=&ss_lt=&is_barrel=&type=details&

jpetrich commented 7 years ago

This was the issue that caused me to add pitcher_throws to the query. Either they've added more columns or limited the size of data they'll return even more since then. I think the ideal solution is to start with broad queries that work at least sometimes (year/team/inning/home_away) and then refine more if necessary (home_road/pitcher_throws) etc. The logic for this will be a little bit tricky to make sure we don't miss any data, especially since it's not perfectly documented what every url parameter does, but I think it's necessary since it already requires 3600 requests/year with pitcher_throws. I'm picturing a flow like this:

[Year/Team/home_away/Inning?] -> [Outs] -> [Handedness] -> [Count] -> [Month] -> [Day]

The month/day part is going to be a little annoying to figure out their date format, and valid date ranges, but I think it's going to be necessary eventually as they either add more data or limit the request response size. What do you think?

alanrkessler commented 7 years ago

Yeah that makes sense. I plan to take a look this weekend more in-depth.

alanrkessler commented 7 years ago

I added opposing team and batter handedness as a first shot. I noticed that we are using different Python versions. Minimal differences in this small script though.

alanrkessler commented 7 years ago

I am testing the 2008 season to see if I get the correct record count. If I do, I will make updates to the raw script and close.

jpetrich commented 7 years ago

Sounds good - hope it does work!

alanrkessler commented 7 years ago

For 2008, I ended up being about 100,000 records short with just batter handedness. What I noticed checking to the website was that when you choose player type equals batter (how it is currently set up), you get about 10,000 fewer pitches than choosing pitcher. That doesn't really make much sense but is something that should be changed.

I also tried opposing team but it takes too long to be practical. I think there needs to be a way to try larger queries and only break them down if they don't work. First step is finding a way to check if one doesn't work and then modifying the script to break it down without adding duplicates.

jpetrich commented 7 years ago

Darn, I was hoping for the easy solution! I can take a look at figuring out how to detect failure and break down the query this week.

alanrkessler commented 7 years ago

Figured out that we are no longer limited to 1,000 records. Now it is just the cost of the query. That means we only need to split by:

The query takes a while to run, so we'll need to have a check in there that makes sure it is populated. It might be worthwhile for errors to populate some sort of dictionary or data frame.

Here is a sample link:

https://baseballsavant.mlb.com/statcast_search?hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=&hfC=&hfSea=2016%7C&hfSit=&player_type=pitcher&hfOuts=0%7C&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&team=LAA&position=&hfRO=&home_road=&hfFlag=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name-event&sort_col=pitches&player_event_sort=api_p_release_speed&sort_order=desc&min_abs=0#results

alanrkessler commented 7 years ago

I made a fix given what I found. Process is much faster. I've given up trying to match the number of pitches from a league total query. I have tested with a couple of years.

jpetrich commented 7 years ago

Awesome work! This is indeed much better.