Pulling tables would be awesome

craine commented 3 years ago

Perhaps I missed it somewhere, but it would be great to go here: https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6829/Stages/15151/PlayerStatistics/England-Premier-League-2017-2018

And grab the entire table(s): Premier League Player Statistics Premier League Assist to Goal Scorer

PickNickChock commented 3 years ago

I'm 99% sure that this package is not capable of pulling tables (and I'm not sure that it should), at least in a more or less «pretty way», but anyway, provided site has several problems:

It tends to block requests from code by IP. Probably it is solved by correct User-Agent, but after autoscraper returned nothing I made a request with requests and in response I saw that I was blocked.
Table contents is downloaded via JavaScript, i.e. when autoscraper or other lib downloads the page there is no content yet to scrape.

At current stage, if you don't want to build custom scraper with smth like BeautifulSoup and you have sites with static content you want to scrape, you can try using Pandas, specifically read_html function that will try to extract tables within that page.

Also, if you want to scrape content, which is downloaded by JavaScript, like in the site you provided, you can try to use something like Selenium to simulate browser and get HTML of the page when everything is loaded

ghost commented 3 years ago

It tends to block requests from code by IP. Probably it is solved by correct User-Agent, but after autoscraper returned nothing I made a request with requests and in response I saw that I was blocked.

its not blocking the request because of user-agent. They have scrape protection incapsula which you can see in the requested header cookies visid_incap and incap_ses.

craine commented 3 years ago

I'm not focusing specifically on that website alone. I tried a few sites with tables. I know I can go grab stuff with BS or Scrapy but thought your tool was cool as hell and would save me a ton of time.

alirezamika commented 3 years ago

Can you share the other websites which you had troubles with tables so we can check? Thanks.

PickNickChock commented 3 years ago

Personally I tried out several sites:

Wikipedia, this page, for example. I want to get Theatre table, so as wanted_list I enter items of the first row (in all other examples I will use the first row as well). When I look at results I see that program scraped data from all tables on the page.
Then I tried this and this sites. They have only one table, so, I thought, everything should go ok. However, in both cases program returned None. Probably, the problem is that wanted items include escape characters.
Also I tried to scrape one tricky table from this page. As wanted list I wrote ['Yes', 'No', 'No', 'No', 'No', 'No'] and I got ['Entire program', 'Yes', 'No', 'Containing class', 'Current assembly', 'Derived types', 'Derived types within current assembly']. As far as I remember, program removes duplicates from results and in some cases (like this one) this may be undesirable.
Finally, I found a simple table here. Enter first row, scrape and voila — we have ['abstract', 'MustInherit', 'internal', 'Friend', 'new', 'Shadows'] and so on. In this case we kinda get what we want but not in a very good format, so, I guess, one would have to format it somehow to work with it further.

alirezamika commented 3 years ago

Thanks for the examples @PickNickChock. They can help really for diagnosis and making it better.

Yes. There's this issue with multiple tables in similar structure and path.
Maybe you didn't scape the characters in wanted list? you should put '\\a' instead of '\a' for python to escape. I checked and there was no problem in both cases.
For scraping tables, I recommend using grouped=True parameter. It will output each column separately without removing duplicates and you can fine-tune the results. Again I didn't have problem with it.
Same as 3.

Also make sure you are using the latest version.

craine commented 3 years ago

Another page was this: https://www.pro-football-reference.com/teams/buf/2019_advanced.htm I'd want to grab each table individually.

felipewhitaker commented 3 years ago

If you want to get tables from a website, why not use pandas?

import pandas as pd

io = "https://en.wikipedia.org/wiki/Daisy_Ridley"
dfs = pd.read_html(io)

# now dfs is a list of the tables in {url} - mostly well formatted and ready to be manipulated

print(dfs[1]) # the second table of {io}
# out

	Year	Title	Role	Notes
0	2013	Lifesaver	Jo	Screen debut; interactive short film[75]
1	2013	Blue Season	Sarah	Short film[75]
2	2013	100% BEEF	Girl	Short film[76]
3	2013	Crossed Wires	Her	Short film[77]
4	2014	Under	Waitress	Short film[75]
5	2015	Scrawl	Hannah	nan
6	2015	Star Wars: The Force Awakens	Rey	nan
7	2016	Only Yesterday	Taeko Okajima	Voice; English dub
8	2016	The Eagle Huntress	Narrator	Voice; also executive producer
9	2017	Murder on the Orient Express	Mary Debenham	nan
10	2017	Star Wars: The Last Jedi	Rey	nan
11	2018	Ophelia	Ophelia	nan
12	2018	Peter Rabbit	Cottontail Rabbit	Voice; also featured in a short companion piece named Flopsy Turvy
13	2019	Star Wars: The Rise of Skywalker	Rey	nan
14	2020	Asteroid Hunters[78]	Narrator	Voice; post-production
15	2021	Chaos Walking	Viola Eade	Post-production

Furthermore, every table you can see in html was received someway. You might just be able to use a request on the url.

import requests

url = ""
req = requests.get(url)
data = req.json()

PickNickChock commented 3 years ago

@felipewhitaker

If you want to get tables from a website, why not use pandas?

Just as I mentioned above. And I guess the reason is that craine would like to do that with autoscraper. Also, sometimes introducing Pandas as another dependency just to grab tables from site would be an overkill.

Furthermore, every table you can see in html was received someway. You might just be able to use a request on the url.

That would mean that you'll need to create custom parser with BS4 or whatever unless site provides some endpoint which returns table data in JSON (I guess that this is what req.json() in your message implies). However, again, the point is that craine, would like to do this with autoscraper

craine commented 3 years ago

@PickNickChock 100%. I know how to use Scrapy and BS4. The beauty of this tool is simplicity. Just thought it'd be a great feature.

ubalklen commented 1 year ago

@craine I agree, tables should be auto scrapable.

In the meantime, I created Untable, a tiny module that does exactly that.

alirezamika / autoscraper

Pulling tables would be awesome #25