Closed alanrkessler closed 7 years ago
Specifically even the link I am pasting is returning a timeout CSV
https://baseballsavant.mlb.com/statcast_search/csv?all=true &hfGT=R%7CPO%7CS%7C&hfPR= &season=2008&player_type=batter &hfOuts=0%7C&team=TEX&position=&hfRO= &home_road=Road&hfInn=8%7C&min_pitches=0 &pitcher_throws=R&min_results=0&group_by=name&sort_col=pitches &player_event_sort=start_speed &sort_order=desc&min_abs=0&xba_gt=&xba_lt=&px1=&px2=&pz1=&pz2=&ss_gt=&ss_lt=&is_barrel=&type=details&
This was the issue that caused me to add pitcher_throws to the query. Either they've added more columns or limited the size of data they'll return even more since then. I think the ideal solution is to start with broad queries that work at least sometimes (year/team/inning/home_away) and then refine more if necessary (home_road/pitcher_throws) etc. The logic for this will be a little bit tricky to make sure we don't miss any data, especially since it's not perfectly documented what every url parameter does, but I think it's necessary since it already requires 3600 requests/year with pitcher_throws. I'm picturing a flow like this:
[Year/Team/home_away/Inning?] -> [Outs] -> [Handedness] -> [Count] -> [Month] -> [Day]
The month/day part is going to be a little annoying to figure out their date format, and valid date ranges, but I think it's going to be necessary eventually as they either add more data or limit the request response size. What do you think?
Yeah that makes sense. I plan to take a look this weekend more in-depth.
I added opposing team and batter handedness as a first shot. I noticed that we are using different Python versions. Minimal differences in this small script though.
I am testing the 2008 season to see if I get the correct record count. If I do, I will make updates to the raw script and close.
Sounds good - hope it does work!
For 2008, I ended up being about 100,000 records short with just batter handedness. What I noticed checking to the website was that when you choose player type equals batter (how it is currently set up), you get about 10,000 fewer pitches than choosing pitcher. That doesn't really make much sense but is something that should be changed.
I also tried opposing team but it takes too long to be practical. I think there needs to be a way to try larger queries and only break them down if they don't work. First step is finding a way to check if one doesn't work and then modifying the script to break it down without adding duplicates.
Darn, I was hoping for the easy solution! I can take a look at figuring out how to detect failure and break down the query this week.
Figured out that we are no longer limited to 1,000 records. Now it is just the cost of the query. That means we only need to split by:
The query takes a while to run, so we'll need to have a check in there that makes sure it is populated. It might be worthwhile for errors to populate some sort of dictionary or data frame.
Here is a sample link:
I made a fix given what I found. Process is much faster. I've given up trying to match the number of pitches from a league total query. I have tested with a couple of years.
Awesome work! This is indeed much better.
Looks like we are hitting some query timeouts @jpetrich