Webcrawler - Githubissues

scientes commented 3 years ago

resolves #3

Its not completely finished, there still is one bug which i'm not able to fix which lies within the crawling framework and i can't figure out why its not working. but that part i disabled for now and only needed if we need the Long name of the player and the country he is playing for.(For people who know scrapy: Every request.meta Dictionary entry i make vanishes when reaching the parse function of the request which for now stops me from passing the short name of the player to the parse function which determines the long name and country he's playing for.)

But for the rest here's an example output:

U G Dowe.csv

matchid,date
Matches/MatchScorecard.asp?MatchCode=0714,1973-02-16
Matches/MatchScorecard.asp?MatchCode=0693,1972-02-16
Matches/MatchScorecard.asp?MatchCode=0684,1971-04-13
Matches/MatchScorecard.asp?MatchCode=0683,1971-04-01

roysti10 commented 3 years ago

Heyy, I would like to test this, to get a general idea of how it works, are there any major bugs in this or just minor ones ?

scientes commented 3 years ago

Minor Bugs which sections are disabled. So everything working runs. Word of warning IT takes about 5k-6k seconds to complete Due to ratelimiting. Ctr+c for ending it a second time to force it. How to run and install is in the readme file. The Code that scrapes the pages is in the spider folder that generates items which are processed by the classes in Pipelines.py.

roysti10 commented 3 years ago

Heyy, just one thing the crawler wastes a lot of time going through matches from 1850's - 1990's . When you actually think about it, nobody from that era plays anymore, is it possible to add some kind of filter to it?

roysti10 commented 3 years ago

It looks good, the only issue would be the full names + country as it is not being able to find them When I scraped them intially from http://www.howstat.com/cricket/Statistics/Players/PlayerListCurrent.asp Each name in it had a Player ID attached to it . That player ID when added to http://www.howstat.com/cricket/Statistics/Players/ would take me to the page of that player from where i took the name Hope this helps!

scientes commented 3 years ago

Sorry i'm atm very short on time

roysti10 commented 3 years ago

@scientes No problem! Do it at your own pace, since this is a enhancement, Its no problem if it takes some time

roysti10 commented 3 years ago

@scientes as requested by you, I have shifted everything to a single csv in the branch master Coming to Long names i would actuall prefer them cause there a lot of players with the same intiials and it might cause errors . Similar for countries

roysti10 commented 3 years ago

@scientes Heyy it would be great if you could give me a update on when u will resume Thanks

scientes commented 3 years ago

Latest in a Week or so.

scientes commented 3 years ago

My current solution is not the best, im atm not filtering the old games out, but i found a way to get the playernames thing working.

roysti10 commented 3 years ago

My current solution is not the best, im atm not filtering the old games out, but i found a way to get the playernames thing working.

That's great! You can push those changes to your branch. Ill test it out. We can make the web-crawler a WIP project if needed since you seem to be short on time. I created a new branch for this PR, do redirect it there. And I'll merge this there. You can continue to help once you are free. Once the conflicts are resolved, We can merge it. Thank you for this! Really appreciate your help . This really is a important part for our project😄

scientes commented 3 years ago

My current status: i get the player data in one file(id,name,gametype,retired) i get the matchid in another file(grouped in folders by ODI/T20/TEST)

Missing:

[ ] bowling
[ ] batting
[ ] wicketkeeping
[ ] keeping track over different runs so that we can crawl incrementaly
[ ] a small script via pandas which maybe cleans and reformats the data

roysti10 commented 3 years ago

My current status: i get the player data in one file(id,name,gametype,retired) i get the matchid in another file(grouped in folders by ODI/T20/TEST)

Missing:

[ ] bowling

[ ] batting

[ ] wicketkeeping

[ ] keeping track over different runs so that we can crawl incrementaly

[ ] a small script via pandas which maybe cleans and reformats the data

I believe this is good enough. Once this is merged I'll make new issues for some of them and start working on them. The only conflicting files seems to be the .gitignore file should be easy to resolve I recommend you sync yourwebcrawler to the repo's webcrawler branch and then we can merge 🎉

EDIT: Since it looks like you are successfull in segregating records. I'm guessing issues #7 , #17 , and partly #5 also gets fixed Ill be closing those issues as well then

roysti10 commented 3 years ago

Also @scientes , even though I merged #20 , I didnt see your name in the contributors list for some reason Just wanted to let you know, don't want it to happen in this case too

scientes commented 3 years ago

i've put some sample data in the output folders. not complete data from crawl

roysti10 commented 3 years ago

i've put some sample data in the output folders. not complete data from crawl

Looks good, probably will need to make a DB server in the future I'll be merging this soon then

roysti10 commented 3 years ago

@scientes Wonderful job! Thanks for contributing!! 😄 🎉

scientes commented 3 years ago

i think the problem with #20 is that you need to contribute a certain amount of code to the default branch to be counted, i think 4 sloc or so is the minimum

HackerSpace-PESU / Best11-Fantasycricket

Webcrawler #22