Closed SRP457 closed 3 years ago
If i'd get a little more info i'd like to work on it using scrapy as a scraping framework if you dont mind.
What do you mean reflect htoe changes in the data folder? Should i keep log on updates made to the data or just make a script which updates the data daily.
Any web scraping tool is welcome, we have currently odi records for each player , under zip(matchids and dates),.zip2(batting records), bowl, wk. As these current players play further matches, it simultaneously should update in those folders, As of now you can update only the zip folder, which contains matchids and dates for each player, if there's any new non retired player whose records have been put up and/or, if the current players have played a new series of ODI cricket, we would like you to add it, for example Pakistan and the england players have played a series recently, we would like you to update those records
Every thing must be scraped from howstat.com you can also update zip2, bowl, and wk using the scoring table in Dataset.md but to put a PR, updating the zip folder is sufficient as of now
I would suggest make a script which updates it
Two things:
also its kinda finished, i'm just fixing bugs atm: https://github.com/scientes/Best11-Fantasycricket/tree/webcrawler
currently it recrawls everything but that is a problem i need to fix later. (atm im using a httpcache for development so pages aren't crawled twice but that doesn't help)
So: do i need to filter out retired players? or do you want all.
Retired players, I dont mind keeping them, Its your call if you want to remove them
For non - Retired players , check this out http://www.howstat.com/cricket/Statistics/Players/PlayerListCurrent.asp As they get removed from this list ,they are conisdered as retired as per http://www.howstat.com/cricket/Statistics/Players/PlayerMenu.asp
Ah thx Other Issue: Git is creating problems for me because the total amount of files is very large due to there being a total amount of 5038 Players wouldn't it be better to make one file per folder and just filter on usage or should i just push all 5k files(in the future x3 or so due to zip2,bowl,wk)? with zip2 and the rest i'm a bit lost on how to calculate because i'm not familar with cricket at all, but those categories are easy to implement with the current crawler.
I didnt understand what you mean by one file per folder. If you could elaborate on that ,It would be helpful Once the folder zip is implemented, zip2 bowl and wk is just a simple function away, so it fine if you dont implement zip2, bowl and wk
Well i mean currently you generate one file per player per folder (a bit less because not everyone is in bowl and wk to my knowledge) with 5000 or so total Players you generate approx 15k-20k Files containing a total of maybe 20-30mb that's a load of files for this little data. It would probably be wise to store all data for zip in one file all for zip2 in one file and so on. I mean you are already using pandas why bother splitting up the data and not just filter in pandas
ohh you mean like one file called zip.csv, zip2.csv ,bowl.csv,wk.csv?
If this is the case, then how will you adjust for each player, are you suggesting something like
player | matches |
---|---|
player1 | matchid1 |
matchid2 | |
matchid3 | |
player2 | matchid1 |
and so on
This actually sounds do-able
Yes that was my idea.
Closed! Thanks to @scientes
Describe the Issue: The player records in the data folder are outdated and static. Thus, they may not be enough to accurately predict player performances in current matches. Previous records were created from web-scraped data from howstat.com.
Solution: Keep the records up to date using web-scraping for daily updates and reflect those changes in the data folder.
Comment if you would like to work on this