Closed craine closed 3 years ago
I'm 99% sure that this package is not capable of pulling tables (and I'm not sure that it should), at least in a more or less «pretty way», but anyway, provided site has several problems:
requests
and in response I saw that I was blocked.At current stage, if you don't want to build custom scraper with smth like BeautifulSoup and you have sites with static content you want to scrape, you can try using Pandas, specifically read_html function that will try to extract tables within that page.
Also, if you want to scrape content, which is downloaded by JavaScript, like in the site you provided, you can try to use something like Selenium to simulate browser and get HTML of the page when everything is loaded
- It tends to block requests from code by IP. Probably it is solved by correct User-Agent, but after autoscraper returned nothing I made a request with
requests
and in response I saw that I was blocked.
its not blocking the request because of user-agent. They have scrape protection incapsula which you can see in the requested header cookies visid_incap and incap_ses.
I'm not focusing specifically on that website alone. I tried a few sites with tables. I know I can go grab stuff with BS or Scrapy but thought your tool was cool as hell and would save me a ton of time.
Can you share the other websites which you had troubles with tables so we can check? Thanks.
Personally I tried out several sites:
wanted_list
I enter items of the first row (in all other examples I will use the first row as well). When I look at results I see that program scraped data from all tables on the page.['Yes', 'No', 'No', 'No', 'No', 'No']
and I got ['Entire program', 'Yes', 'No', 'Containing class', 'Current assembly', 'Derived types', 'Derived types within current assembly']
. As far as I remember, program removes duplicates from results and in some cases (like this one) this may be undesirable.['abstract', 'MustInherit', 'internal', 'Friend', 'new', 'Shadows']
and so on. In this case we kinda get what we want but not in a very good format, so, I guess, one would have to format it somehow to work with it further.Thanks for the examples @PickNickChock. They can help really for diagnosis and making it better.
'\\a'
instead of '\a'
for python to escape. I checked and there was no problem in both cases.grouped=True
parameter. It will output each column separately without removing duplicates and you can fine-tune the results. Again I didn't have problem with it.Also make sure you are using the latest version.
Another page was this: https://www.pro-football-reference.com/teams/buf/2019_advanced.htm I'd want to grab each table individually.
If you want to get tables from a website, why not use pandas?
import pandas as pd
io = "https://en.wikipedia.org/wiki/Daisy_Ridley"
dfs = pd.read_html(io)
# now dfs is a list of the tables in {url} - mostly well formatted and ready to be manipulated
print(dfs[1]) # the second table of {io}
# out
Year | Title | Role | Notes | |
---|---|---|---|---|
0 | 2013 | Lifesaver | Jo | Screen debut; interactive short film[75] |
1 | 2013 | Blue Season | Sarah | Short film[75] |
2 | 2013 | 100% BEEF | Girl | Short film[76] |
3 | 2013 | Crossed Wires | Her | Short film[77] |
4 | 2014 | Under | Waitress | Short film[75] |
5 | 2015 | Scrawl | Hannah | nan |
6 | 2015 | Star Wars: The Force Awakens | Rey | nan |
7 | 2016 | Only Yesterday | Taeko Okajima | Voice; English dub |
8 | 2016 | The Eagle Huntress | Narrator | Voice; also executive producer |
9 | 2017 | Murder on the Orient Express | Mary Debenham | nan |
10 | 2017 | Star Wars: The Last Jedi | Rey | nan |
11 | 2018 | Ophelia | Ophelia | nan |
12 | 2018 | Peter Rabbit | Cottontail Rabbit | Voice; also featured in a short companion piece named Flopsy Turvy |
13 | 2019 | Star Wars: The Rise of Skywalker | Rey | nan |
14 | 2020 | Asteroid Hunters[78] | Narrator | Voice; post-production |
15 | 2021 | Chaos Walking | Viola Eade | Post-production |
Furthermore, every table you can see in html was received someway. You might just be able to use a request on the url.
import requests
url = ""
req = requests.get(url)
data = req.json()
@felipewhitaker
If you want to get tables from a website, why not use pandas?
Just as I mentioned above. And I guess the reason is that craine would like to do that with autoscraper. Also, sometimes introducing Pandas as another dependency just to grab tables from site would be an overkill.
Furthermore, every table you can see in html was received someway. You might just be able to use a request on the url.
That would mean that you'll need to create custom parser with BS4 or whatever unless site provides some endpoint which returns table data in JSON (I guess that this is what req.json()
in your message implies). However, again, the point is that craine, would like to do this with autoscraper
@PickNickChock 100%. I know how to use Scrapy and BS4. The beauty of this tool is simplicity. Just thought it'd be a great feature.
Perhaps I missed it somewhere, but it would be great to go here: https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6829/Stages/15151/PlayerStatistics/England-Premier-League-2017-2018
And grab the entire table(s): Premier League Player Statistics Premier League Assist to Goal Scorer