GSS-Cogs / family-trade

1 stars 2 forks source link

Scraper not brining in distributions as expected. #363

Closed Shannon95 closed 2 years ago

Shannon95 commented 2 years ago

Scraper for gov.uk not bringing distributions form the following landing page.

Only the title and description are brought in.

image.png

This is blocking #229.

Shannon95 commented 2 years ago

its this line: https://github.com/GSS-Cogs/gss-utils/blob/b95a5dbc47398e343c66e3ec04646e04c73fdc9f/gssutils/scrapers/govuk.py#L139 gov.uk have renamed it from metadata to govuk-body metadata.

ajtucker commented 2 years ago

@santhosh-thangavel noted that the HTML on gov.uk has an added govuk-body class in the list of classes for the metadata paragraph: image.png

Ideally, we'd be able to use the Gov UK Content API to get access to this metadata, but at the time of writing of the metadata scraper, only some of the metadata was available through this API, with the rest having to be parsed out of the HTML response (although that HTML response was encoded in the output from the API). It may be that things have changed and would be worth checking.

Parsing HTML like this is brittle, in that as far as a machine is concerned, any changes to the styling of the output may break any assumptions about what the structure of the output means.

As such, we try to be accommodating when parsing HTML and just focus in on the things we think won't change too much. That way, the metadata scraper will retain some robustness when trivial things change.

We use XPath extensively for navigating through HTML documents as it has decent support in the Python library we use and is also usable directly in e.g. Chrome Developer Tools (F12) with the "Find by string, selector, or XPath" ctrl-f when looking at the HTML elements in a page.

One XPath pattern (I think from Jeni Tennison's XPath books) is to look for a partial match of an element's attribute -- usually the class attribute -- by interpreting the attribute as a list of space separated words/identifiers. In this case, instead of:

div_attach.xpath("p[@class='metadata']")

we could use:

div_attach.xpath("p[contains(concat(' ', @class, ' '), ' metadata ')]")

This is a bit more long-winded, and note the spaces around the metadata identifier. This is to deal with all the different ways that metadata could appear in a white-space separated list.

Shannon95 commented 2 years ago

closing as the fix is now in place https://github.com/GSS-Cogs/gss-utils/releases/tag/v0.14.3