alirezamika / autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
MIT License
6.09k stars 635 forks source link

Pulling tables would be awesome #25

Closed craine closed 3 years ago

craine commented 3 years ago

Perhaps I missed it somewhere, but it would be great to go here: https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6829/Stages/15151/PlayerStatistics/England-Premier-League-2017-2018

And grab the entire table(s): Premier League Player Statistics Premier League Assist to Goal Scorer

PickNickChock commented 3 years ago

I'm 99% sure that this package is not capable of pulling tables (and I'm not sure that it should), at least in a more or less «pretty way», but anyway, provided site has several problems:

  1. It tends to block requests from code by IP. Probably it is solved by correct User-Agent, but after autoscraper returned nothing I made a request with requests and in response I saw that I was blocked.
  2. Table contents is downloaded via JavaScript, i.e. when autoscraper or other lib downloads the page there is no content yet to scrape.

At current stage, if you don't want to build custom scraper with smth like BeautifulSoup and you have sites with static content you want to scrape, you can try using Pandas, specifically read_html function that will try to extract tables within that page.

Also, if you want to scrape content, which is downloaded by JavaScript, like in the site you provided, you can try to use something like Selenium to simulate browser and get HTML of the page when everything is loaded

ghost commented 3 years ago
  1. It tends to block requests from code by IP. Probably it is solved by correct User-Agent, but after autoscraper returned nothing I made a request with requests and in response I saw that I was blocked.

its not blocking the request because of user-agent. They have scrape protection incapsula which you can see in the requested header cookies visid_incap and incap_ses.

craine commented 3 years ago

I'm not focusing specifically on that website alone. I tried a few sites with tables. I know I can go grab stuff with BS or Scrapy but thought your tool was cool as hell and would save me a ton of time.

alirezamika commented 3 years ago

Can you share the other websites which you had troubles with tables so we can check? Thanks.

PickNickChock commented 3 years ago

Personally I tried out several sites:

  1. Wikipedia, this page, for example. I want to get Theatre table, so as wanted_list I enter items of the first row (in all other examples I will use the first row as well). When I look at results I see that program scraped data from all tables on the page.
  2. Then I tried this and this sites. They have only one table, so, I thought, everything should go ok. However, in both cases program returned None. Probably, the problem is that wanted items include escape characters.
  3. Also I tried to scrape one tricky table from this page. As wanted list I wrote ['Yes', 'No', 'No', 'No', 'No', 'No'] and I got ['Entire program', 'Yes', 'No', 'Containing class', 'Current assembly', 'Derived types', 'Derived types within current assembly']. As far as I remember, program removes duplicates from results and in some cases (like this one) this may be undesirable.
  4. Finally, I found a simple table here. Enter first row, scrape and voila — we have ['abstract', 'MustInherit', 'internal', 'Friend', 'new', 'Shadows'] and so on. In this case we kinda get what we want but not in a very good format, so, I guess, one would have to format it somehow to work with it further.
alirezamika commented 3 years ago

Thanks for the examples @PickNickChock. They can help really for diagnosis and making it better.

  1. Yes. There's this issue with multiple tables in similar structure and path.
  2. Maybe you didn't scape the characters in wanted list? you should put '\\a' instead of '\a' for python to escape. I checked and there was no problem in both cases.
  3. For scraping tables, I recommend using grouped=True parameter. It will output each column separately without removing duplicates and you can fine-tune the results. Again I didn't have problem with it.
  4. Same as 3.

Also make sure you are using the latest version.

craine commented 3 years ago

Another page was this: https://www.pro-football-reference.com/teams/buf/2019_advanced.htm I'd want to grab each table individually.

felipewhitaker commented 3 years ago

If you want to get tables from a website, why not use pandas?

import pandas as pd

io = "https://en.wikipedia.org/wiki/Daisy_Ridley"
dfs = pd.read_html(io)

# now dfs is a list of the tables in {url} - mostly well formatted and ready to be manipulated

print(dfs[1]) # the second table of {io}
# out
Year Title Role Notes
0 2013 Lifesaver Jo Screen debut; interactive short film[75]
1 2013 Blue Season Sarah Short film[75]
2 2013 100% BEEF Girl Short film[76]
3 2013 Crossed Wires Her Short film[77]
4 2014 Under Waitress Short film[75]
5 2015 Scrawl Hannah nan
6 2015 Star Wars: The Force Awakens Rey nan
7 2016 Only Yesterday Taeko Okajima Voice; English dub
8 2016 The Eagle Huntress Narrator Voice; also executive producer
9 2017 Murder on the Orient Express Mary Debenham nan
10 2017 Star Wars: The Last Jedi Rey nan
11 2018 Ophelia Ophelia nan
12 2018 Peter Rabbit Cottontail Rabbit Voice; also featured in a short companion piece named Flopsy Turvy
13 2019 Star Wars: The Rise of Skywalker Rey nan
14 2020 Asteroid Hunters[78] Narrator Voice; post-production
15 2021 Chaos Walking Viola Eade Post-production

Furthermore, every table you can see in html was received someway. You might just be able to use a request on the url.

import requests

url = ""
req = requests.get(url)
data = req.json()
PickNickChock commented 3 years ago

@felipewhitaker

If you want to get tables from a website, why not use pandas?

Just as I mentioned above. And I guess the reason is that craine would like to do that with autoscraper. Also, sometimes introducing Pandas as another dependency just to grab tables from site would be an overkill.

Furthermore, every table you can see in html was received someway. You might just be able to use a request on the url.

That would mean that you'll need to create custom parser with BS4 or whatever unless site provides some endpoint which returns table data in JSON (I guess that this is what req.json() in your message implies). However, again, the point is that craine, would like to do this with autoscraper

craine commented 3 years ago

@PickNickChock 100%. I know how to use Scrapy and BS4. The beauty of this tool is simplicity. Just thought it'd be a great feature.

ubalklen commented 1 year ago

@craine I agree, tables should be auto scrapable.

In the meantime, I created Untable, a tiny module that does exactly that.