abjer / isds2020

Introduction to Social Data Science 2020 - a summer school course abjer.github.io/isds2020
58 stars 92 forks source link

Session 6: .json() or .text? #37

Open jacobwiberg opened 4 years ago

jacobwiberg commented 4 years ago

In the first part of session 6 when scraping Jobindex, we used the .json method on the 'response' output from our connector-function. However in the second half when scraping Trustpilot, we frequently used the .text method on the 'response' output.

if response.ok: d = response.json()

and

if response.ok: html = response.text

While I understand that we're looking for links in the second task and a more complete dataset in the first, both of them are scraping tasks. Is there any general rule of thumb on which of the two methods to use, when scraping websites?

jsr-p commented 4 years ago

hi @jwibmetrics , if we are interested in the html of the given page we use response.text. This is what we want if we are interested in getting data that is visible from the page.

If we know that the page we are requesting returns a json file, e.g. when we have found a link that the page calls by inspecting the network monitor and then the XHR tab, we use response.json().

jacobwiberg commented 4 years ago

Makes sense, thanks for the quick response!