Big-Data-FC / scraper

sofifa.com scraper, built to scrape data needed for our project of Big Data Computing 2021-22 at Sapienza University of Rome
MIT License
15 stars 2 forks source link

Scarica solo i primi 60 giocatori #4

Closed trivius86 closed 1 year ago

trivius86 commented 1 year ago

Lo script funziona alla grande, e risolto anche il valore Overal che mancava. Però ora non so come mai all'improvviso quando lancio lo Scrap mi scarica solo i primi 60 giocatori , in pratica solo la prima pagina. Vi allego il risultato che restituisce il terminal che per voi è sicuramente qualcosa di piu chiaro

Screenshot 2023-02-06 alle 16 41 25
trivius86 commented 1 year ago

Ho scavalcato il problema attivando questa voce: USER_AGENT = 'sofifa (+http://www.yourdomain.com)'

in pratica lo script diceva al sito di sofifa di essere uno scrap e veniva bloccato alla prima pagina... ora però me ne restituisce un'altro dopo tot numero di giocatori scaricati (tra gli 800 e 1200 circa). Allego dato di errore ![Uploading Screenshot 2023-02-07 alle 09.20.15.png…]()

trivius86 commented 1 year ago

2023-02-07 09:19:55 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): sofifa.com:443 2023-02-07 09:19:55 [urllib3.connectionpool] DEBUG: https://sofifa.com:443 "GET /team/111144/seattle-sounders/?r=230012 HTTP/1.1" 429 None 2023-02-07 09:19:55 [scrapy.core.scraper] ERROR: Spider error processing <GET https://sofifa.com/players?set=true&col=oa&sort=desc&showCol%5B0%5D=pi&showCol%5B1%5D=ae&showCol%5B2%5D=hi&showCol%5B3%5D=wi&showCol%5B4%5D=pf&showCol%5B5%5D=oa&showCol%5B6%5D=pt&showCol%5B7%5D=bo&showCol%5B8%5D=bp&showCol%5B9%5D=gu&showCol%5B10%5D=jt&showCol%5B11%5D=le&showCol%5B12%5D=vl&showCol%5B13%5D=wg&showCol%5B14%5D=rc&showCol%5B15%5D=ta&showCol%5B16%5D=cr&showCol%5B17%5D=fi&showCol%5B18%5D=he&showCol%5B19%5D=sh&showCol%5B20%5D=vo&showCol%5B21%5D=ts&showCol%5B22%5D=dr&showCol%5B23%5D=cu&showCol%5B24%5D=fr&showCol%5B25%5D=lo&showCol%5B26%5D=bl&showCol%5B27%5D=to&showCol%5B28%5D=ac&showCol%5B29%5D=sp&showCol%5B30%5D=ag&showCol%5B31%5D=re&showCol%5B32%5D=ba&showCol%5B33%5D=tp&showCol%5B34%5D=so&showCol%5B35%5D=ju&showCol%5B36%5D=st&showCol%5B37%5D=sr&showCol%5B38%5D=ln&showCol%5B39%5D=te&showCol%5B40%5D=ar&showCol%5B41%5D=in&showCol%5B42%5D=po&showCol%5B43%5D=vi&showCol%5B44%5D=pe&showCol%5B45%5D=cm&showCol%5B46%5D=td&showCol%5B47%5D=ma&showCol%5B48%5D=sa&showCol%5B49%5D=sl&showCol%5B50%5D=tg&showCol%5B51%5D=gd&showCol%5B52%5D=gh&showCol%5B53%5D=gk&showCol%5B54%5D=gp&showCol%5B55%5D=gr&showCol%5B56%5D=tt&showCol%5B57%5D=bs&showCol%5B58%5D=wk&showCol%5B59%5D=sk&showCol%5B60%5D=aw&showCol%5B61%5D=dw&showCol%5B62%5D=ir&showCol%5B63%5D=pac&showCol%5B64%5D=sho&showCol%5B65%5D=pas&showCol%5B66%5D=dri&showCol%5B67%5D=def&showCol%5B68%5D=phy&r=230012&offset=780> (referer: https://sofifa.com/players?set=true&col=oa&sort=desc&showCol%5B0%5D=pi&showCol%5B1%5D=ae&showCol%5B2%5D=hi&showCol%5B3%5D=wi&showCol%5B4%5D=pf&showCol%5B5%5D=oa&showCol%5B6%5D=pt&showCol%5B7%5D=bo&showCol%5B8%5D=bp&showCol%5B9%5D=gu&showCol%5B10%5D=jt&showCol%5B11%5D=le&showCol%5B12%5D=vl&showCol%5B13%5D=wg&showCol%5B14%5D=rc&showCol%5B15%5D=ta&showCol%5B16%5D=cr&showCol%5B17%5D=fi&showCol%5B18%5D=he&showCol%5B19%5D=sh&showCol%5B20%5D=vo&showCol%5B21%5D=ts&showCol%5B22%5D=dr&showCol%5B23%5D=cu&showCol%5B24%5D=fr&showCol%5B25%5D=lo&showCol%5B26%5D=bl&showCol%5B27%5D=to&showCol%5B28%5D=ac&showCol%5B29%5D=sp&showCol%5B30%5D=ag&showCol%5B31%5D=re&showCol%5B32%5D=ba&showCol%5B33%5D=tp&showCol%5B34%5D=so&showCol%5B35%5D=ju&showCol%5B36%5D=st&showCol%5B37%5D=sr&showCol%5B38%5D=ln&showCol%5B39%5D=te&showCol%5B40%5D=ar&showCol%5B41%5D=in&showCol%5B42%5D=po&showCol%5B43%5D=vi&showCol%5B44%5D=pe&showCol%5B45%5D=cm&showCol%5B46%5D=td&showCol%5B47%5D=ma&showCol%5B48%5D=sa&showCol%5B49%5D=sl&showCol%5B50%5D=tg&showCol%5B51%5D=gd&showCol%5B52%5D=gh&showCol%5B53%5D=gk&showCol%5B54%5D=gp&showCol%5B55%5D=gr&showCol%5B56%5D=tt&showCol%5B57%5D=bs&showCol%5B58%5D=wk&showCol%5B59%5D=sk&showCol%5B60%5D=aw&showCol%5B61%5D=dw&showCol%5B62%5D=ir&showCol%5B63%5D=pac&showCol%5B64%5D=sho&showCol%5B65%5D=pas&showCol%5B66%5D=dri&showCol%5B67%5D=def&showCol%5B68%5D=phy&r=230012&offset=720) Traceback (most recent call last): File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/utils/defer.py", line 132, in iter_errback yield next(it) File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/utils/python.py", line 354, in next return next(self.data) File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/utils/python.py", line 354, in next return next(self.data) File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output for x in result: File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/spidermiddlewares/referer.py", line 342, in return (_set_referer(r) for r in result or ()) File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/spidermiddlewares/urllength.py", line 40, in return (r for r in result or () if _filter(r)) File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in return (r for r in result or () if _filter(r)) File "/Users/alessandroagostinelli/opt/anaconda3/lib/python3.9/site-packages/scrapy/core/spidermw.py", line 66, in _evaluate_iterable for r in iterable: File "/Users/alessandroagostinelli/Documents/scraper-main/src/sofifa/spiders/sofifa.py", line 53, in parse "league_name": self.parse_team(team_url), File "/Users/alessandroagostinelli/Documents/scraper-main/src/sofifa/spiders/sofifa.py", line 90, in parse_team league_name = resp.css(".info a::text").get()[:-4] TypeError: 'NoneType' object is not subscriptable 2023-02-07 09:19:55 [scrapy.core.engine] INFO: Closing spider (finished) 2023-02-07 09:19:55 [scrapy.extensions.feedexport] INFO: Stored csv feed (857 items) in: out.csv 2023-02-07 09:19:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 48581, 'downloader/request_count': 15, 'downloader/request_method_count/GET': 15, 'downloader/response_bytes': 610316, 'downloader/response_count': 15, 'downloader/response_status_count/200': 15, 'dupefilter/filtered': 12, 'elapsed_time_seconds': 45.280125, 'feedexport/success_count/FileFeedStorage': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2023, 2, 7, 8, 19, 55, 296281), 'httpcompression/response_bytes': 10623734, 'httpcompression/response_count': 15, 'item_scraped_count': 857, 'log_count/DEBUG': 1154, 'log_count/ERROR': 1, 'log_count/INFO': 12, 'memusage/max': 66174976, 'memusage/startup': 66174976, 'request_depth_max': 13, 'response_received_count': 15, 'scheduler/dequeued': 15, 'scheduler/dequeued/memory': 15, 'scheduler/enqueued': 15, 'scheduler/enqueued/memory': 15, 'spider_exceptions/TypeError': 1, 'start_time': datetime.datetime(2023, 2, 7, 8, 19, 10, 16156)} 2023-02-07 09:19:55 [scrapy.core.engine] INFO: Spider closed (finished) (base) alessandroagostinelli@iMacdiAssandro2 src %

trivius86 commented 1 year ago

Al momento ho risolto rimuovendo questa riga: "league_name": self.parse_team(team_url),

Però non ne capisco la causa dell'errore

davquar commented 1 year ago

Hello @trivius86

File "/Users/alessandroagostinelli/Documents/scraper-main/src/sofifa/spiders/sofifa.py", line 90, in parse_team
league_name = resp.css(".info a::text").get()[:-4]
TypeError: 'NoneType' object is not subscriptable

This suggests that the league name cannot be extracted like that. In this moment I cannot help you to debug the issue, but I can tell you that as I'm seeing there, MLS has a different name scheme: [United States] Major League Soccer , instead of what the script assumes.

Perhaps you can extend the script so that MLS is scrapable too 😉

trivius86 commented 1 year ago

Hello @trivius86

File "/Users/alessandroagostinelli/Documents/scraper-main/src/sofifa/spiders/sofifa.py", line 90, in parse_team
league_name = resp.css(".info a::text").get()[:-4]
TypeError: 'NoneType' object is not subscriptable

This suggests that the league name cannot be extracted like that. In this moment I cannot help you to debug the issue, but I can tell you that as I'm seeing there, MLS has a different name scheme: [United States] Major League Soccer , instead of what the script assumes.

Perhaps you can extend the script so that MLS is scrapable too 😉

Ok grazie mille per la disponibilità, infatti togliendo il nome della lega da scrappare sono riuscito a procedere (tanto non mi servivano i nomi lega) però così mi hai fatto capire dove potrebbe essere il problema e posso fare delle prove. Grazie mille!

trivius86 commented 1 year ago

anyway congratulations, your code is absolutely the best for having a clean, tidy and precise csv

davquar commented 1 year ago

Thank you 😄

altarrok commented 1 year ago

@trivius86 could you please share how did you fix the "Overall" please? thank you

trivius86 commented 1 year ago

@trivius86 could you please share how did you fix the "Overall" please? thank you

This is the link of the topic with the solution

https://github.com/Big-Data-FC/scraper/issues/3