HackerSpace-PESU / Best11-Fantasycricket

Predicting the Best 11 for a fantasy cricket game
GNU Affero General Public License v3.0
24 stars 17 forks source link

[BUG] Web crawler searches through matches from the 1900s #40

Closed roysti10 closed 3 years ago

roysti10 commented 3 years ago

Describe the bug The web crawler in feature-crawler takes in match records from the 1900s . This wastes a lot of time and reduces efficiency of the crawler To Reproduce Steps to reproduce the behavior:

  1. Follow the instructions in the README file to run the crawler
  2. Wait for the Ids crawl to finish and notice

Expected behavior The solution to this would be to set a filter which takes match records only from the year 2017 and greater Possible solution in cralwer/cricketcrawler/spiders/howstat.py in function parse_scorecard

if int(date[0:4]) >= 2017:
     item=MatchidItem(name=url[startint+10:],folder=folder,matchid=matchid,date=date)
      yield item

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context The starting point to this might be crawler/cricketcrawler/spiders/howstat.py

issue-label-bot[bot] commented 3 years ago

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.97. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

scientes commented 3 years ago

ehm i might have to shed some light here as well: yielding items does not decreaese Performance. requesting pages does. i just found a neat hack for our problem:

http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231

the url encodes the range of matches we need: so from jan1 2017 - 31.Dec 3000 = 2017010130001231

So we only need to crawl these links: http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231 http://www.howstat.com/cricket/Statistics/Matches/MatchList_ODI.asp?Group=2017010130001231 http://www.howstat.com/cricket/Statistics/Matches/MatchList.asp?Group=2017010130001231 http://www.howstat.com/cricket/Statistics/IPL/MatchList.asp?Group=2017010130001231

roysti10 commented 3 years ago

ehm i might have to shed some light here as well: yielding items does not decreaese Performance. requesting pages does. i just found a neat hack for our problem:

http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231

the url encodes the range of matches we need: so from jan1 2017 - 31.Dec 3000 = 2017010130001231

So we only need to crawl these links: http://www.howstat.com/cricket/Statistics/Matches/MatchList_T20.asp?Group=2017010130001231 http://www.howstat.com/cricket/Statistics/Matches/MatchList_ODI.asp?Group=2017010130001231 http://www.howstat.com/cricket/Statistics/Matches/MatchList.asp?Group=2017010130001231 http://www.howstat.com/cricket/Statistics/IPL/MatchList.asp?Group=2017010130001231

This is an amazing hack. This would reduce the searching by a lot. Thanks I'll implement this soon once I'm free

roysti10 commented 3 years ago

Im not too sure if this will be needed I am adding a wontfix label for now, until its figured out