alphagov / govuk-knowledge-graph-gcp

GOV.UK content data and cloud infrastructure for the GovSearch app.
https://docs.data-community.publishing.service.gov.uk/tools/govgraph/
MIT License
10 stars 1 forks source link

Replace w3m with selenium and chromedriver #555

Closed nacnudus closed 8 months ago

nacnudus commented 1 year ago

Trello

Using w3m to render HTML to plain text seems hacky. We have to choose a really high line wrap character count to avoid paragraphs being split across lines. It might be slow to do a system call from Python.

Selenium is widely used, with chromedriver, which is one of the fastest browsers. The .text() method renders plain text in the way that we need: one line per paragraph, ignoring \n and putting headings <h2>Heading</h2> on their own line. It is available in Alpine.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

WINDOW_SIZE = "1920,1080"

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)

driver = webdriver.Chrome(options=chrome_options)

# Parse an HTML string with Selenium https://stackoverflow.com/a/52498445/937932
html_content = """
<div class=\"govspeak\"><h2 id=\"companies-house-official-statistics-quality-information\">Companies House official statistics: Quality information</h2>\n\n<p>This document provides information on the quality of Companies House’s official statistics, to\nenable users to judge whether or not the data are of sufficient quality for their intended use.\nThe information is structured in terms of the quality dimensions of the <a rel=\"external\" href=\"http://ec.europa.eu/eurostat/documents/64157/4392716/ESS-QAF-V1-2final.pdf/bbf5970c-1adf-46c8-afc3-58ce177a0646\" class=\"govuk-link\">European Statistical\nSystem</a>.</p>\n\n<h3 id=\"relevance\">Relevance</h3>\n\n<p>(Relevance is the degree to which the statistical product meets user needs for both coverage\nand content)</p>\n\n<p>Companies House incorporates and dissolves limited companies, registers the information\ncompanies are legally required to supply, and makes that information available to the public.\nIt is responsible for the official government register of UK companies. To this end\nCompanies House holds the most comprehensive record of company activity in the UK. The\nstatistics releases provide figures both for the UK as a whole, and for England and Wales,\nScotland, and Northern Ireland individually. Annual figures cover the most recent five years\nof data for the constituent countries and the UK as a whole. Data back to 1939 are published\nfor Great Britain up to 2008 to 2009 and for the United Kingdom from 2009 to 2010 onwards. Monthly\nfigures covering the most recent five years were published both for the UK as a whole and\nfor the constituent countries until July 2016; these monthly publications have now been\nreplaced with quarterly figures following a user consultation earlier in 2016.</p>\n\n<p>Companies House carries out regular user engagement activity to ensure its products meet\nusers’ needs. The most recent user consultation ran from March to May 2016 and covered\nboth the content and frequency of Companies House’s official statistics publications. In\naddition to the change from monthly to quarterly publications described above, changes to\nthe format of the publications were also made as a result of the feedback received.</p>\n\n<p>In addition to this formal user engagement activity, Companies House welcomes feedback\nfrom users of its statistics. This feedback can be provided by emailing\n<a href=\"mailto:statistics@companieshouse.gov.uk\" class=\"govuk-link\">statistics@companieshouse.gov.uk</a>.</p>\n\n<h3 id=\"accuracy-and-reliability\">Accuracy and reliability</h3>\n\n<p>(This refers to the closeness between an estimated or stated result and the [unknown] true\nvalue)</p>\n\n<p>Companies House is the organisation responsible for incorporating and dissolving\ncompanies, and all companies are required to register with Companies House to legally\nexist. The statistics should, therefore, be a complete record of company activity in the UK.</p>\n\n<p>Companies House is the official source of information on company activity in the UK and the\nstatistics are based on information submitted to Companies House by or on behalf of\ncompanies. Companies House has limited power to verify this information. In addition,\ncompanies are registered at Companies House regardless of whether they go on to trade\nactively or not. The statistics reported do not distinguish between active and inactive\ncompanies. These factors should be borne in mind when considering the figures reported by\nCompanies House.</p>\n\n<h3 id=\"timeliness-and-punctuality\">Timeliness and punctuality</h3>\n\n<p>(Timeliness refers to the elapsed time between publication and the period to which the data\nrefer. Punctuality refers to the time lag between the actual and planned dates of publication)</p>\n\n<p>Statistical releases are published on a quarterly and annual basis. Quarterly statistics are\npublished on the last Thursday in the month following the end of the period being reported.\nAnnual statistics cover a financial year and are published in the summer following the end of\nthe year. These are the earliest publication dates that allow the compilation of the statistical\nrelease in its final form ready for publication.</p>\n\n<p>Publications are pre-announced on the <a href=\"https://www.gov.uk/government/statistics/announcements?utf8=%E2%9C%93&amp;organisations%5B%5D=companies-house\" class=\"govuk-link\">gov.uk release calendar</a> and the statistics have\nalways been published on schedule.</p>\n\n<h3 id=\"coherence\">Coherence</h3>\n\n<p>(Coherence is the degree to which data are derived from different sources or methods, but\nwhich refer to the same phenomenon, are similar)</p>\n\n<p>This section provides brief information on how these statistics relate to selected business\nstatistics. More detailed information can be found in the <a rel=\"external\" href=\"https://www.ons.gov.uk/businessindustryandtrade/business/activitysizeandlocation/methodologies/businesspopulation\" class=\"govuk-link\">Guide to the Business Population</a> between business statistics. It focusses on the differences between estimates of the\nbusiness population and includes a range of related statistics.</p>\n\n<p>Further information on the difference between companies and business is available in the\nguidance document <a href=\"https://www.gov.uk/government/publications/definitions-to-accompany-our-statistical-releases/companies-house-official-statistics-definitions-to-accompany-statistical-releases#companies-and-businesses\" class=\"govuk-link\">Definitions to accompany statistical releases</a>.</p>\n\n<h3 id=\"company-incorporations-and-business-creation\">Company incorporations and business creation</h3>\n\n<p>There is a range of official data sources available to monitor business creation, but each has\na slightly different coverage. Taken together, they provide a good overall picture of the trend\nin business creation activity. Individually, each source will be suitable for different, specific\npurposes. Sources include:</p>\n\n<ul>\n  <li>Companies House Incorporations – new company registrations, including those not\nactively trading. Incorporations are one source of statistics on business creation. They\nprovide information on newly formed companies that are added to the Companies House\nregister. Incorporated Companies can go on to trade actively, but some will be dormant\ncompanies that do not trade actively. Companies House Incorporations do not capture\nbusiness start-ups of other business types such as those starting up as an\nunincorporated sole proprietorship or partnership.</li>\n  <li>\n<a rel=\"external\" href=\"https://www.ons.gov.uk/businessindustryandtrade/business/activitysizeandlocation/bulletins/businessdemography/previousReleases\" class=\"govuk-link\">Business Demography</a>, which provides information for businesses registering for VAT or\nPAYE. This is an annual release that provides information on business ‘births’, defined\nas new registrations for VAT or PAYE. Business Demography does not capture the\nsmallest, non-employing business start-ups which do not register for VAT or PAYE.</li>\n</ul>\n\n<h3 id=\"business-population\">Business population</h3>\n\n<p>There are a number of official statistics that provide information on the size of the business\npopulation. Each source will be suitable for different specific purposes. Sources include:</p>\n\n<ul>\n  <li>Companies House provides information on the total number of incorporated companies\nthat are filing documents to Companies House. Two figures are provided: the ‘total’\nregister, which includes companies that are trading, dormant and in the process of \nliquidation or dissolution, and the ‘effective’ register, which includes those trading,\ndormant and in receivership, but excludes those companies in the process of liquidation\nor dissolution.</li>\n  <li>\n<a href=\"https://www.gov.uk/government/collections/business-population-estimates\" class=\"govuk-link\">Business Population Estimates</a>, which provide the only estimate of the total UK business\npopulation. It includes information on incorporated companies and unincorporated sole\nproprietorships or partnerships.</li>\n  <li>\n<a rel=\"external\" href=\"https://www.ons.gov.uk/businessindustryandtrade/business/activitysizeandlocation/bulletins/ukbusinessactivitysizeandlocation/previousReleases\" class=\"govuk-link\">UK Business</a>, which provides more detail on the business population that has registered\nfor VAT or PAYE.</li>\n</ul>\n\n<h3 id=\"company-insolvency\">Company insolvency</h3>\n\n<p><a href=\"https://www.gov.uk/government/statistics?departments%5B%5D=insolvency-service\" class=\"govuk-link\">The Insolvency Service</a> reports the most complete picture on insolvency statistics, including\ncompany liquidations and individual insolvencies, as it has policy responsibility for all forms\nof corporate insolvency in England and Wales. Compulsory liquidations published in the\ntables that accompany this release differ from those published by the Insolvency Service.\nThe Insolvency Service’s compulsory liquidations statistics are sourced from their\nadministrative systems. All other forms of company insolvency published by the Insolvency\nService are on the same basis as those published by Companies House.</p>\n\n<h3 id=\"comparability\">Comparability</h3>\n\n<p>(Comparability refers to the degree to which data can be compared over time and domain)</p>\n\n<p>Figures for 1979 to 2008 are for Great Britain. In October 2009, the Northern Ireland register\nmerged with the register for Great Britain to create a UK register. Figures from 2009 to 2010\nonwards are for the UK and therefore are not directly comparable with earlier figures.</p>\n\n<p>Figures for 1979 to 1986 are for the calendar year 1 January to 31 December. Those for\n1986 to 1987 onwards are for the financial year 1 April to 31 March and are not directly\ncomparable with those for earlier years.</p>\n\n<p>The period 2009 to 2010 was a time of significant change for the register:</p>\n\n<ul>\n  <li>The Northern Ireland register was included to create a UK register, as described above;</li>\n  <li>There was a change in the administrative system that forms the register;</li>\n  <li>There was a purge on the register to remove defunct companies that had spent an\nextended period in the process of dissolution or liquidation;</li>\n  <li>Legislative changes were introduced under the Companies Act 2006, which reduced the\ntime taken to dissolve companies and remove them from the register.</li>\n</ul>\n\n<p>These changes in combination are likely to have had an impact on both the numbers of\nincorporations and dissolutions, and on the sizes of the total and effective registers. Care\nshould be taken when comparing figures from 2009 to 2010 onwards with those from earlier\nyears.</p>\n\n<h3 id=\"accessibility-and-clarity\">Accessibility and clarity</h3>\n\n<p>(Accessibility is the ease with which users are able to access the data. It also relates to the\nformat in which the data are available and the availability of supporting information. Clarity\nrefers to the quality and sufficiency of metadata, illustrations and accompanying advice)</p>\n\n<p>Companies House’s statistics are available free of charge to the end user on the Companies\nHouse website. They are released via the gov.uk <a href=\"https://www.gov.uk/government/statistics/announcements?utf8=%E2%9C%93&amp;organisations%5B%5D=companies-house\" class=\"govuk-link\">Publication Hub</a>. Historic data are also published on the <a rel=\"external\" href=\"http://webarchive.nationalarchives.gov.uk/20141104103730/http:/www.companieshouse.gov.uk/about/statisticsAndSurveys.shtml\" class=\"govuk-link\">National Archives</a> website.</p>\n\n<p>The statistical releases are published in html format and contain additional commentary\nexplaining the main findings. Data tables are published as Excel files. The quarterly data is\nalso published in csv format, with annual data being published in this format for the 2015 to 2016\npublication onwards.</p>\n\n<p>Views on the clarity of the publication are welcomed by emailing\n<a href=\"mailto:statistics@companieshouse.gov.uk\" class=\"govuk-link\">statistics@companieshouse.gov.uk</a>.</p>\n</div>
"""

driver.get("data:text/html;charset=utf-8," + html_content)

el = driver.find_element(By.TAG_NAME, 'div')
print(el.text)

for el in driver.find_elements(By.TAG_NAME, 'a'):
    print(el.text)
    print(el.get_attribute('href'))

driver.close()
nacnudus commented 11 months ago

Could be done as part of https://github.com/alphagov/govuk-knowledge-graph-gcp/issues/559

nacnudus commented 8 months ago

Done with pandoc, after finding that selenium and chromedriver were intolerably slow.