Closed DonFloriano27 closed 2 years ago
Hey @DonFloriano27.
Thanks for your interest in the project. As I mentioned in our discussion, scraping the market_value
so it's added to the dataset should be possible. For that, you will need to update the players
crawler.
Provided you've setup your local conda
environment as explained here, the players crawler can be run from your local machine as follows
> scrapy crawl players -a parents=samples/clubs.json -s USER_AGENT='some user agent of yours'
Could you let me know if your are able to run the crawler successfully? Once you get there, I'll give show you how you can start a local scrapy shell so you can interactively test your xpath/css expressions to scrape the market_value
.
Hi @dcaribou! Yes I set up everything and the crawler works. I familiarized with the parents structure and scraped the players of e.g. FC Bayern Munich. Is it correct that the data is limited to {"type": "player", "href": "/dimitri-oberlin/profil/spieler/212718", "parent": {"type": "club", "href": "/fc-bayern-munchen/startseite/verein/27", "seasoned_href": "https://www.transfermarkt.co.uk/fc-bayern-munchen/startseite/verein/27/saison_id/2020"} unlike in the sample players.json?
Please introduce me to the scrapy shell. Never seen or used anything like this..
Thats great @DonFloriano27 !
Is it correct that the data is limited to {"type": "player", "href": "/dimitri-oberlin/profil/spieler/212718", "parent": {"type": "club", "href": "/fc-bayern-munchen/startseite/verein/27", "seasoned_href": "https://www.transfermarkt.co.uk/fc-bayern-munchen/startseite/verein/27/saison_id/2020"} unlike in the sample players.json?
No it's not. This is for sure a change on transfermarkt side. They've changed the html for the "player data" section. This happens from time to time, it's the curse of scrapers 😄 I will have a look at it and try to fix it in separate issue (#29 ).
Please introduce me to the scrapy shell. Never seen or used anything like this..
The scrapy shell is nothing else than a normal python shell with a few scrapy objects set in the context. I can't really describe it any better than in the official scrapy documentation.
The easiest way for you to launch a scrapy shell for the players
crawler with the context you care about is by uncommenting these lines
https://github.com/dcaribou/transfermarkt-scraper/blob/8478d52d8a7da7b3ef3be3deb46e6c802bb0b7d8/tfmkt/spiders/players.py#L45
This will open a shell with a player url loaded in the context for you to try out some parsing expression.
Let me know how it goes!
Hey! So I could run the shell but have trouble extracting (parsing?) any data. Could you send me an example console code of an shell extraction with the Transfermarkt-scraper? THX :)
Hey @DonFloriano27 Once you are in the scrapy shell, you'll have a python shell with some useful scrapy objects loaded in it (you should be able to see a list of available objects at the top of the shell).
response
contains the actual HTML from a player's page. You can query the html in the response using xpath or css expressions. Like here
https://github.com/dcaribou/transfermarkt-scraper/blob/8478d52d8a7da7b3ef3be3deb46e6c802bb0b7d8/tfmkt/spiders/players.py#L52
In this way you can try your query expressions to get the market value from that page.
Another interesting tip is that if you run view(response)
it will open the page in a browser. In chrome you can use the developer tools to inspect the html of a page. This is quite helpful for figuring out the css/xpath expressions that you need to extract the portions of the html that you are interested in. You should be able to open it with alt + cmd + I
.
Finally I recommend that you have a look at css/xpath querying from within scrapy. Here's the link to the scrapy documentation page on using selectors.
Hi @dcaribou!
So in the shell I can extract my desired data with: response.xpath("//div[@class='marktwertentwicklung']//text()").getall()
I will get this output: ['\n ', '\n ', '\n ', '\n ', '\n ', '\n Current market value:\n ', '\n ', '\n ', '£5.40m', ' ', '\n ', '\n ', '\n ', '\n ', 'Last update:', '\n ', '\n ', '\n ', 'Jun 8, 2021', '\n ', '\n ', '\n ', '\n ', '\n Highest market value:\n ', 'Last update:', '\n ', '\n ', '\n £9.00m ', '\n ', 'Dec 17, 2019', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', ' ', '\n ', '\n ', '\n ', 'Market value details', '\n '] Since there is not table formatting, how is it possible to get this data in a clean order?
It looks good! I'd suggest though that you refine your query to get the specific values more easily, since that array list you got will be a bit hard to manage. So try to get to something like
current_market_value = response.xpath("your refined xpath query").get()
highest_market_value = response.xpath("your refined xpath query").get()
you then just need to add it to the attributes
dict like
attributes['current_market_value'] = current_market_value
attributes['highest_market_value'] = highest_market_value
and then the crawler will produce those new fields in the output JSON.
from tfmkt.spiders.common import BaseSpider from scrapy.shell import inspect_response # required for debugging import re from inflection import parameterize, underscore
class PlayersSpider(BaseSpider): name = 'players'
def parse(self, response, parent): """Parse clubs's page to collect all player's urls.
@url https://www.transfermarkt.co.uk/manchester-city/kader/verein/281/saison_id/2019
@returns requests 34 34
@cb_kwargs {"parent": "dummy"}
"""
player_hrefs = response.css(
'a.spielprofil_tooltip::attr(href)'
).getall()
without_duplicates = list(set(player_hrefs))
for href in without_duplicates:
cb_kwargs = {
'base' : {
'type': 'player',
'href': href,
'parent': parent
}
}
yield response.follow(href, self.parse_details, cb_kwargs=cb_kwargs)
def parse_details(self, response, base): """Extract player details from the main page. It currently only parses the PLAYER DATA section.
@url https://www.transfermarkt.co.uk/joel-mumbongo/profil/spieler/381156
@returns items 1 1
@cb_kwargs {"base": {"href": "some_href", "type": "player", "parent": {}}}
@scrapes href type parent
"""
# uncommenting the two lines below will open a scrapy shell with the context of this request
# when you run the crawler. this is useful for developing new extractors
#inspect_response(response, self)
#exit(1)
#response.xpath("//div[@class='player-data-personal-info ']/span//text()").getall()
#current_market_value = response.xpath("//div[@class='right-td']/text()").getall()
#attributes['current_market_value'] = current_market_value
# parse 'PLAYER DATA' section
counter = []
attributes = {}
counter = response.xpath("//div/span[@class='player-data-personal-info__content player-data-personal-info__content--left']//text()").getall()
counter = len(counter)
for number in range(1,counter):
data_path = "//span[@class='player-data-personal-info__content player-data-personal-info__content--left'][{0}]".format(number)
key = response.xpath(data_path + '//text()').get().strip()
data_path = "//span[@class='player-data-personal-info__content player-data-personal-info__content--right'][{0}]".format(number)
# try extracting the value as text
value = response.xpath(data_path + '//text()').get()
if not value or len(value.strip()) == 0:
# if text extraction fails, attempt 'href' extraction
href = response.xpath(data_path + '//@href').get()
if href and len(href.strip()) > 0:
value = {
'href': response.xpath(data_path + '//@href').get()
}
# if both text and href extraction fails, it must be text + image kind of cell
# "approximate" parsing extracting the 'title' property
else:
text = response.xpath(data_path + '//img/@title').get()
value = text
else:
value = value.strip()
attributes[key] = value
# Get Market Value of player
key = response.xpath("//div[@class='marktwertentwicklung']//div[@class='zeile-oben']//div[@class='left-td']//text()").get().strip()
value = response.xpath("//div[@class='marktwertentwicklung']//div[@class='zeile-oben']//div[@class='right-td']//a/text()").get().strip()
attributes[key] = value
key = response.xpath("//div[@class='marktwertentwicklung']//div[@class='zeile-unten']//div[@class='left-td']//text()").get().strip()
value = response.xpath("//div[@class='marktwertentwicklung']//div[@class='zeile-unten']//div[@class='right-td']//text()").get().strip()
attributes[key] = value
yield {
**base,
**attributes
}
It's not very beautiful, but it does the job :D Maybe you find some improvements.. testing it with different clubs it didn't throw an error.
It looks good! Can you submit your changes in a pull request as it's suggested in the contribute section in the README? This way I can review the changes and, some automatic tests will run, and if it all looks good we can merge your contribution and the new field will be available for use in transfermarkt-datasets
.
I haven't seen your changes... but I added the market value now.
As I said in the beginning I want the data of all german clubs from first, second, third and maybe even fourth tier. But this doesn't work b/c the competition crawler doesn't suggest these competitions. I tried editing the clubs.json manually like this:
{"type": "club", "href": "/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020", "parent": {"type": "league", "href": "/2-bundesliga/startseite/wettbewerb/L2"}}
But it doesn't work.. any suggestion without editing all crawlers?
I haven't seen your changes... but I added the market value now. Are you going to raise a pull request for this changes?
About scraping german lower tiers, I have to say I have never tried it. If the structure of the competition page for lower tiers is the same as for the first tier ones, it should work though.
Yup, just tried with your sample clubs.json
as parent and it did scrape the players as expected.
> scrapy crawl players -a parents=second_tier_club.json -s USER_AGENT=<user agent>
and I got something like
{"type": "player", "href": "/scott-kennedy/profil/spieler/418300", "parent": {"type": "club", "href": "/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020", "seasoned_href": "https://www.transfermarkt.co.uk/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020/saison_id/2020"}, "name_in_home_country": "Scott Fitzgerald Kennedy", "date_of_birth": "Mar 31, 1997", "place_of_birth": {"country": "Canada", "city": "Calgary"}, "age": "24", "height": "1,90 m", "citizenship": "Canada", "position": "Centre-Back", "player_agent": {"href": "/haspel-sportconsulting/beraterfirma/berater/1243", "name": "Haspel Sportconsulting"}, "current_club": {"href": "/ssv-jahn-regensburg/startseite/verein/109"}, "foot": "left", "joined": "Aug 18, 2020", "contract_expires": "Jun 30, 2023", "day_of_last_contract_extension": null, "outfitter": null}
{"type": "player", "href": "/florian-heister/profil/spieler/333661", "parent": {"type": "club", "href": "/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020", "seasoned_href": "https://www.transfermarkt.co.uk/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020/saison_id/2020"}, "name_in_home_country": null, "date_of_birth": "Mar 2, 1997", "place_of_birth": {"country": "Germany", "city": "Neuss"}, "age": "24", "height": "1,76 m", "citizenship": "Germany", "position": "Right Midfield", "player_agent": {"href": "/omegasports-dr-g-zarotis-consulting/beraterfirma/berater/2766", "name": "OmegaSports Dr. G. Zarotis"}, "current_club": {"href": "/fc-viktoria-koln/startseite/verein/1622"}, "foot": "both", "joined": "Jul 1, 2021", "contract_expires": "Jun 30, 2024", "day_of_last_contract_extension": null, "outfitter": null, "social_media": ["http://www.instagram.com/florian_heister/"]}
{"type": "player", "href": "/sebastian-stolze/profil/spieler/157927", "parent": {"type": "club", "href": "/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020", "seasoned_href": "https://www.transfermarkt.co.uk/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020/saison_id/2020"}, "name_in_home_country": null, "date_of_birth": "Jan 29, 1995", "place_of_birth": {"country": "Germany", "city": "Leinefelde"}, "age": "26", "height": "1,82 m", "citizenship": "Germany", "position": "Right Winger", "player_agent": {"href": "/karl-m-herzog-sportmanagement/beraterfirma/berater/76", "name": "Karl M. Herzog Sportmanagement"}, "current_club": {"href": "/hannover-96/startseite/verein/42"}, "foot": "right", "joined": "Jul 1, 2021", "contract_expires": "Jun 30, 2024", "day_of_last_contract_extension": null, "outfitter": null}
{"type": "player", "href": "/max-besuschkow/profil/spieler/207578", "parent": {"type": "club", "href": "/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020", "seasoned_href": "https://www.transfermarkt.co.uk/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020/saison_id/2020"}, "name_in_home_country": "\u041c\u0430\u043a\u0441 \u0411\u0435\u0437\u0443\u0448\u043a\u043e\u0432", "date_of_birth": "May 31, 1997", "place_of_birth": {"country": "Germany", "city": "T\u00fcbingen"}, "age": "24", "height": "1,87 m", "citizenship": "Germany", "position": "Central Midfield", "player_agent": {"href": "/haspel-sportconsulting/beraterfirma/berater/1243", "name": "Haspel Sportconsulting"}, "current_club": {"href": "/ssv-jahn-regensburg/startseite/verein/109"}, "foot": "right", "joined": "Jul 1, 2019", "contract_expires": "Jun 30, 2022", "day_of_last_contract_extension": null, "outfitter": null, "social_media": ["http://www.instagram.com/maxbesuschkow/"]}
Are you going to raise a pull request for this changes? Just done!
Yup, just tried with your sample clubs.json as parent and it did scrape the players as expected. Oh gerat.. it just worked for me too. Probably some typo when I tried. Awesome! Thanks so much :)
Hey @DonFloriano27. I've just merged your changes to the main branch. If there are no other comments, I believe we can close this issue.
Figure out scraping of market_value and update this function, so the market_value of each player is added to the player details. Then include the data into the dataset.