alphagov / govuk-knowledge-graph-gcp

GOV.UK content data and cloud infrastructure for the GovSearch app.
https://docs.data-community.publishing.service.gov.uk/tools/govgraph/
MIT License
8 stars 1 forks source link

Extract phone numbers #530

Closed nacnudus closed 8 months ago

nacnudus commented 1 year ago

https://trello.com/c/mSjX7Dq0/2353-think-about-phone-numbers

Users seem to often search for phone numbers, but they might not be getting a complete set of results.  This could lead to outdated phone numbers remaining on GOV.UK.

We ought to consider mitigating this problem.

Evidence of user need: https://docs.google.com/document/d/1auqzEXTiwAgNPG6PfxDDG7rSt7ImShKxEOo_zBaUsAE/edit#heading=h.aowahbaamoq

One way would be to use Google's libphonenumber to detect phone numbers in GOV.UK content, and format them in a standard form. Similarly, format searched-for phone numbers in a standard form, and then do a lookup.

The script below uses a python implementation of libphonenumber. I wrote my own python bindings to Google's C++ implementation, and it had exactly the same performance :smiling_face_with_tear:

# Extract phone numbers from a text column of a CSV file.

import argparse
import sys
import csv
import phonenumbers
import phonenumbers.geocoder

# Allow the largest field size possible.
# https://stackoverflow.com/a/15063941
maxInt = sys.maxsize
while True:
    # decrease the maxInt value by factor 10
    # as long as the OverflowError occurs.
    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt / 10)
csv.field_size_limit(maxInt)

writer = csv.DictWriter(
    sys.stdout,
    fieldnames=["url", "phonenumber", "start", "raw_string", "type", "country", "location"],
)

region = "GB"
locale = "en"
number_format = phonenumbers.PhoneNumberFormat.E164

with open("../body_content.csv", newline="") as csvfile:
    reader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
    for line in reader:
        row_dict = {"url": line["url"]}
        text = line["text_without_blank_lines"]
        for match in phonenumbers.PhoneNumberMatcher(text, region):
            if phonenumbers.is_valid_number(match.number):
                row_dict["phonenumber"] = phonenumbers.format_number(
                    match.number, number_format
                )
                row_dict["start"] = match.start
                row_dict["raw_string"] = match.raw_string
                row_dict["type"] = phonenumbers.PhoneNumberType.to_string(
                    phonenumbers.number_type(match.number)
                )
                row_dict["country"] = phonenumbers.geocoder.country_name_for_number(
                    match.number, locale
                )
                row_dict["location"] = phonenumbers.geocoder.description_for_number(
                    match.number, region
                )
                writer.writerow(row_dict)
url phonenumber start raw_string type country location
https://www.gov.uk/ad-dalu-gordaliadau-budd-dal-plant +443002001900 3018 0300 200 1900 UAN
https://www.gov.uk/adrodd-pryder-am-atwrnai-dirprwy-warcheidwad +441159342777 1537 0115 934 2777 FIXED_LINE United Kingdom Nottingham
https://www.gov.uk/adrodd-pryder-am-atwrnai-dirprwy-warcheidwad +441159342778 1564 0115 934 2778 FIXED_LINE United Kingdom Nottingham
https://www.gov.uk/anti-money-laundering-registration +441415822000 1557 0141 582 2000 FIXED_LINE United Kingdom Glasgow
https://www.gov.uk/anti-money-laundering-registration +442073973008 1688 020 7397 3008 FIXED_LINE United Kingdom London
https://www.gov.uk/anti-money-laundering-registration +441914930272 1897 0191 4930272 FIXED_LINE United Kingdom Tyneside
https://www.gov.uk/anti-money-laundering-registration +442073400551 2093 020 7340 0551 FIXED_LINE United Kingdom London
https://www.gov.uk/anti-money-laundering-registration +441234845777 2381 01234 845777 FIXED_LINE United Kingdom Bedford
https://www.gov.uk/anti-money-laundering-registration +442073400550 2990 020 7340 0550 FIXED_LINE United Kingdom London
nacnudus commented 9 months ago

There are three ways to obtain phone numbers:

There are 3432 phone numbers in pages with schema_name='contact' in the Content Store, and there are another 154 that aren’t in the Content Store because they don’t have a base_path, which means they are only visible (either online or in GovSearch) when they are embedded in other pages. That isn’t as many as I feared.

# Expensive query! 100GB+
CREATE OR REPLACE TABLE
  test.documents_now AS
WITH
  latest_edition_per_document AS (
  SELECT
    \*
  FROM
    publishing.editions QUALIFY ROW_NUMBER() OVER (PARTITION BY document_id ORDER BY updated_at DESC) = 1 )
SELECT
  documents.content_id,
  documents.locale,
  latest_edition_per_document.*
FROM
  latest_edition_per_document
INNER JOIN
  publishing.documents
ON
  documents.id = latest_edition_per_document.document_id
WHERE
  state <> 'draft' ;
CREATE OR REPLACE TABLE
  `govuk-knowledge-graph-dev.test.contact_phone_numbers` AS
SELECT
  title,
  base_path,
  content_id,
  JSON_EXTRACT_SCALAR(phone_numbers, '$.number') AS number
FROM
  `govuk-knowledge-graph-dev.test.documents_now`,
  UNNEST(JSON_EXTRACT_ARRAY(details, '$.phone_numbers')) AS phone_numbers
WHERE
  schema_name = 'contact'
SELECT
  base_path IS NULL AS has_base_path,
  COUNT(*) AS n
FROM
  `govuk-knowledge-graph-dev.test.contact_phone_numbers`
GROUP BY
  has_base_path

An example number that isn’t on a page in it’s own right is the fax number for the British Embassy in Reykjavik.

It is embedded in each of the following pages.

But it doesn’t appear in the ‘body’ text of those pages.

It does appear in the ‘body’ text of at least four pages, from which we could extract it with libphonenumber, or we could use the entities.

WITH url_with_phones AS (
SELECT *,
REGEXP_REPLACE(name, ' ', '') as phone_no_spaces
FROM `cpto-content-metadata.named_entities.named_entities_all`
WHERE type = "PHONE"
)

SELECT *
FROM url_with_phones
WHERE phone_no_spaces IN ("+3545505105")
nacnudus commented 9 months ago

If this can be done in javascript instead of Ruby or Python, then we don't even need to create cloud functions. We can define them directly in BigSQL. Google's own javascript library isn't useable for this, because it's difficult to install, and anyway it doesn't support find() to find numbers in general text. But Google themselves link to two forks that we could use instead.

nacnudus commented 8 months ago

Done by #571