dcaribou / transfermarkt-scraper

🕸️ Collects data from Transfermarkt website
92 stars 35 forks source link

Add market_value of player to scraper and dataset #28

Closed DonFloriano27 closed 2 years ago

DonFloriano27 commented 3 years ago

Figure out scraping of market_value and update this function, so the market_value of each player is added to the player details. Then include the data into the dataset.

dcaribou commented 3 years ago

Hey @DonFloriano27.

Thanks for your interest in the project. As I mentioned in our discussion, scraping the market_value so it's added to the dataset should be possible. For that, you will need to update the players crawler.

Provided you've setup your local conda environment as explained here, the players crawler can be run from your local machine as follows

> scrapy crawl players -a parents=samples/clubs.json -s USER_AGENT='some user agent of yours'

Could you let me know if your are able to run the crawler successfully? Once you get there, I'll give show you how you can start a local scrapy shell so you can interactively test your xpath/css expressions to scrape the market_value.

DonFloriano27 commented 3 years ago

Hi @dcaribou! Yes I set up everything and the crawler works. I familiarized with the parents structure and scraped the players of e.g. FC Bayern Munich. Is it correct that the data is limited to {"type": "player", "href": "/dimitri-oberlin/profil/spieler/212718", "parent": {"type": "club", "href": "/fc-bayern-munchen/startseite/verein/27", "seasoned_href": "https://www.transfermarkt.co.uk/fc-bayern-munchen/startseite/verein/27/saison_id/2020"} unlike in the sample players.json?

DonFloriano27 commented 3 years ago

Please introduce me to the scrapy shell. Never seen or used anything like this..

dcaribou commented 3 years ago

Thats great @DonFloriano27 !

Is it correct that the data is limited to {"type": "player", "href": "/dimitri-oberlin/profil/spieler/212718", "parent": {"type": "club", "href": "/fc-bayern-munchen/startseite/verein/27", "seasoned_href": "https://www.transfermarkt.co.uk/fc-bayern-munchen/startseite/verein/27/saison_id/2020"} unlike in the sample players.json?

No it's not. This is for sure a change on transfermarkt side. They've changed the html for the "player data" section. This happens from time to time, it's the curse of scrapers 😄 I will have a look at it and try to fix it in separate issue (#29 ).

Please introduce me to the scrapy shell. Never seen or used anything like this..

The scrapy shell is nothing else than a normal python shell with a few scrapy objects set in the context. I can't really describe it any better than in the official scrapy documentation.

The easiest way for you to launch a scrapy shell for the players crawler with the context you care about is by uncommenting these lines https://github.com/dcaribou/transfermarkt-scraper/blob/8478d52d8a7da7b3ef3be3deb46e6c802bb0b7d8/tfmkt/spiders/players.py#L45 This will open a shell with a player url loaded in the context for you to try out some parsing expression.

Let me know how it goes!

DonFloriano27 commented 3 years ago

Hey! So I could run the shell but have trouble extracting (parsing?) any data. Could you send me an example console code of an shell extraction with the Transfermarkt-scraper? THX :)

dcaribou commented 3 years ago

Hey @DonFloriano27 Once you are in the scrapy shell, you'll have a python shell with some useful scrapy objects loaded in it (you should be able to see a list of available objects at the top of the shell).

Screenshot 2021-09-21 at 18 34 07

response contains the actual HTML from a player's page. You can query the html in the response using xpath or css expressions. Like here https://github.com/dcaribou/transfermarkt-scraper/blob/8478d52d8a7da7b3ef3be3deb46e6c802bb0b7d8/tfmkt/spiders/players.py#L52

In this way you can try your query expressions to get the market value from that page.

Another interesting tip is that if you run view(response) it will open the page in a browser. In chrome you can use the developer tools to inspect the html of a page. This is quite helpful for figuring out the css/xpath expressions that you need to extract the portions of the html that you are interested in. You should be able to open it with alt + cmd + I.

Finally I recommend that you have a look at css/xpath querying from within scrapy. Here's the link to the scrapy documentation page on using selectors.

DonFloriano27 commented 3 years ago

Hi @dcaribou!

So in the shell I can extract my desired data with: response.xpath("//div[@class='marktwertentwicklung']//text()").getall()

I will get this output: ['\n ', '\n ', '\n ', '\n ', '\n ', '\n Current market value:\n ', '\n ', '\n ', '£5.40m', ' ', '\n ', '\n ', '\n ', '\n ', 'Last update:', '\n ', '\n ', '\n ', 'Jun 8, 2021', '\n ', '\n ', '\n ', '\n ', '\n Highest market value:\n ', 'Last update:', '\n ', '\n ', '\n £9.00m ', '\n ', 'Dec 17, 2019', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', ' ', '\n ', '\n ', '\n ', 'Market value details', '\n '] Since there is not table formatting, how is it possible to get this data in a clean order?

dcaribou commented 3 years ago

It looks good! I'd suggest though that you refine your query to get the specific values more easily, since that array list you got will be a bit hard to manage. So try to get to something like

current_market_value = response.xpath("your refined xpath query").get()
highest_market_value = response.xpath("your refined xpath query").get()

you then just need to add it to the attributes dict like

attributes['current_market_value'] = current_market_value
attributes['highest_market_value'] = highest_market_value

and then the crawler will produce those new fields in the output JSON.

DonFloriano27 commented 3 years ago

from tfmkt.spiders.common import BaseSpider from scrapy.shell import inspect_response # required for debugging import re from inflection import parameterize, underscore

class PlayersSpider(BaseSpider): name = 'players'

def parse(self, response, parent): """Parse clubs's page to collect all player's urls.

    @url https://www.transfermarkt.co.uk/manchester-city/kader/verein/281/saison_id/2019
    @returns requests 34 34
    @cb_kwargs {"parent": "dummy"}
  """

  player_hrefs = response.css(
        'a.spielprofil_tooltip::attr(href)'
    ).getall()

  without_duplicates = list(set(player_hrefs))

  for href in without_duplicates:

      cb_kwargs = {
        'base' : {
          'type': 'player',
          'href': href,
          'parent': parent
        }
      }

      yield response.follow(href, self.parse_details, cb_kwargs=cb_kwargs)

def parse_details(self, response, base): """Extract player details from the main page. It currently only parses the PLAYER DATA section.

  @url https://www.transfermarkt.co.uk/joel-mumbongo/profil/spieler/381156
  @returns items 1 1
  @cb_kwargs {"base": {"href": "some_href", "type": "player", "parent": {}}}
  @scrapes href type parent
"""

# uncommenting the two lines below will open a scrapy shell with the context of this request
# when you run the crawler. this is useful for developing new extractors

#inspect_response(response, self)
#exit(1)

#response.xpath("//div[@class='player-data-personal-info ']/span//text()").getall()
#current_market_value = response.xpath("//div[@class='right-td']/text()").getall()
#attributes['current_market_value'] = current_market_value

# parse 'PLAYER DATA' section
counter = []
attributes = {}

counter = response.xpath("//div/span[@class='player-data-personal-info__content player-data-personal-info__content--left']//text()").getall()
counter = len(counter)

for number in range(1,counter):
  data_path = "//span[@class='player-data-personal-info__content player-data-personal-info__content--left'][{0}]".format(number)
  key = response.xpath(data_path + '//text()').get().strip()

  data_path = "//span[@class='player-data-personal-info__content player-data-personal-info__content--right'][{0}]".format(number)
  # try extracting the value as text
  value = response.xpath(data_path + '//text()').get()
  if not value or len(value.strip()) == 0:
    # if text extraction fails, attempt 'href' extraction
    href = response.xpath(data_path + '//@href').get()
    if href and len(href.strip()) > 0:
      value = {
        'href': response.xpath(data_path + '//@href').get()
      }
    # if both text and href extraction fails, it must be text + image kind of cell
    # "approximate" parsing extracting the 'title' property
    else:
      text = response.xpath(data_path + '//img/@title').get()
      value = text
  else:
    value = value.strip()
  attributes[key] = value

# Get Market Value of player
key = response.xpath("//div[@class='marktwertentwicklung']//div[@class='zeile-oben']//div[@class='left-td']//text()").get().strip()
value = response.xpath("//div[@class='marktwertentwicklung']//div[@class='zeile-oben']//div[@class='right-td']//a/text()").get().strip()
attributes[key] = value
key = response.xpath("//div[@class='marktwertentwicklung']//div[@class='zeile-unten']//div[@class='left-td']//text()").get().strip()
value = response.xpath("//div[@class='marktwertentwicklung']//div[@class='zeile-unten']//div[@class='right-td']//text()").get().strip()
attributes[key] = value

yield {
  **base,
  **attributes
}
DonFloriano27 commented 3 years ago

It's not very beautiful, but it does the job :D Maybe you find some improvements.. testing it with different clubs it didn't throw an error.

dcaribou commented 3 years ago

It looks good! Can you submit your changes in a pull request as it's suggested in the contribute section in the README? This way I can review the changes and, some automatic tests will run, and if it all looks good we can merge your contribution and the new field will be available for use in transfermarkt-datasets.

DonFloriano27 commented 3 years ago

I haven't seen your changes... but I added the market value now.

As I said in the beginning I want the data of all german clubs from first, second, third and maybe even fourth tier. But this doesn't work b/c the competition crawler doesn't suggest these competitions. I tried editing the clubs.json manually like this:

{"type": "club", "href": "/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020", "parent": {"type": "league", "href": "/2-bundesliga/startseite/wettbewerb/L2"}}

But it doesn't work.. any suggestion without editing all crawlers?

dcaribou commented 3 years ago

I haven't seen your changes... but I added the market value now. Are you going to raise a pull request for this changes?

About scraping german lower tiers, I have to say I have never tried it. If the structure of the competition page for lower tiers is the same as for the first tier ones, it should work though.

dcaribou commented 3 years ago

Yup, just tried with your sample clubs.json as parent and it did scrape the players as expected.

> scrapy crawl players -a parents=second_tier_club.json -s USER_AGENT=<user agent>

and I got something like

{"type": "player", "href": "/scott-kennedy/profil/spieler/418300", "parent": {"type": "club", "href": "/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020", "seasoned_href": "https://www.transfermarkt.co.uk/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020/saison_id/2020"}, "name_in_home_country": "Scott Fitzgerald Kennedy", "date_of_birth": "Mar 31, 1997", "place_of_birth": {"country": "Canada", "city": "Calgary"}, "age": "24", "height": "1,90 m", "citizenship": "Canada", "position": "Centre-Back", "player_agent": {"href": "/haspel-sportconsulting/beraterfirma/berater/1243", "name": "Haspel Sportconsulting"}, "current_club": {"href": "/ssv-jahn-regensburg/startseite/verein/109"}, "foot": "left", "joined": "Aug 18, 2020", "contract_expires": "Jun 30, 2023", "day_of_last_contract_extension": null, "outfitter": null}
{"type": "player", "href": "/florian-heister/profil/spieler/333661", "parent": {"type": "club", "href": "/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020", "seasoned_href": "https://www.transfermarkt.co.uk/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020/saison_id/2020"}, "name_in_home_country": null, "date_of_birth": "Mar 2, 1997", "place_of_birth": {"country": "Germany", "city": "Neuss"}, "age": "24", "height": "1,76 m", "citizenship": "Germany", "position": "Right Midfield", "player_agent": {"href": "/omegasports-dr-g-zarotis-consulting/beraterfirma/berater/2766", "name": "OmegaSports Dr. G. Zarotis"}, "current_club": {"href": "/fc-viktoria-koln/startseite/verein/1622"}, "foot": "both", "joined": "Jul 1, 2021", "contract_expires": "Jun 30, 2024", "day_of_last_contract_extension": null, "outfitter": null, "social_media": ["http://www.instagram.com/florian_heister/"]}
{"type": "player", "href": "/sebastian-stolze/profil/spieler/157927", "parent": {"type": "club", "href": "/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020", "seasoned_href": "https://www.transfermarkt.co.uk/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020/saison_id/2020"}, "name_in_home_country": null, "date_of_birth": "Jan 29, 1995", "place_of_birth": {"country": "Germany", "city": "Leinefelde"}, "age": "26", "height": "1,82 m", "citizenship": "Germany", "position": "Right Winger", "player_agent": {"href": "/karl-m-herzog-sportmanagement/beraterfirma/berater/76", "name": "Karl M. Herzog Sportmanagement"}, "current_club": {"href": "/hannover-96/startseite/verein/42"}, "foot": "right", "joined": "Jul 1, 2021", "contract_expires": "Jun 30, 2024", "day_of_last_contract_extension": null, "outfitter": null}
{"type": "player", "href": "/max-besuschkow/profil/spieler/207578", "parent": {"type": "club", "href": "/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020", "seasoned_href": "https://www.transfermarkt.co.uk/ssv-jahn-regensburg/startseite/verein/109/saison_id/2020/saison_id/2020"}, "name_in_home_country": "\u041c\u0430\u043a\u0441 \u0411\u0435\u0437\u0443\u0448\u043a\u043e\u0432", "date_of_birth": "May 31, 1997", "place_of_birth": {"country": "Germany", "city": "T\u00fcbingen"}, "age": "24", "height": "1,87 m", "citizenship": "Germany", "position": "Central Midfield", "player_agent": {"href": "/haspel-sportconsulting/beraterfirma/berater/1243", "name": "Haspel Sportconsulting"}, "current_club": {"href": "/ssv-jahn-regensburg/startseite/verein/109"}, "foot": "right", "joined": "Jul 1, 2019", "contract_expires": "Jun 30, 2022", "day_of_last_contract_extension": null, "outfitter": null, "social_media": ["http://www.instagram.com/maxbesuschkow/"]}
DonFloriano27 commented 3 years ago

Are you going to raise a pull request for this changes? Just done!

Yup, just tried with your sample clubs.json as parent and it did scrape the players as expected. Oh gerat.. it just worked for me too. Probably some typo when I tried. Awesome! Thanks so much :)

dcaribou commented 3 years ago

Hey @DonFloriano27. I've just merged your changes to the main branch. If there are no other comments, I believe we can close this issue.