Closed nacnudus closed 8 months ago
There are three ways to obtain phone numbers:
cpto-content-metadata.named_entities.named_entities_all
There are 3432 phone numbers in pages with schema_name='contact'
in the Content Store, and there are another 154 that aren’t in the Content Store because they don’t have a base_path
, which means they are only visible (either online or in GovSearch) when they are embedded in other pages. That isn’t as many as I feared.
# Expensive query! 100GB+
CREATE OR REPLACE TABLE
test.documents_now AS
WITH
latest_edition_per_document AS (
SELECT
\*
FROM
publishing.editions QUALIFY ROW_NUMBER() OVER (PARTITION BY document_id ORDER BY updated_at DESC) = 1 )
SELECT
documents.content_id,
documents.locale,
latest_edition_per_document.*
FROM
latest_edition_per_document
INNER JOIN
publishing.documents
ON
documents.id = latest_edition_per_document.document_id
WHERE
state <> 'draft' ;
CREATE OR REPLACE TABLE
`govuk-knowledge-graph-dev.test.contact_phone_numbers` AS
SELECT
title,
base_path,
content_id,
JSON_EXTRACT_SCALAR(phone_numbers, '$.number') AS number
FROM
`govuk-knowledge-graph-dev.test.documents_now`,
UNNEST(JSON_EXTRACT_ARRAY(details, '$.phone_numbers')) AS phone_numbers
WHERE
schema_name = 'contact'
SELECT
base_path IS NULL AS has_base_path,
COUNT(*) AS n
FROM
`govuk-knowledge-graph-dev.test.contact_phone_numbers`
GROUP BY
has_base_path
An example number that isn’t on a page in it’s own right is the fax number for the British Embassy in Reykjavik.
It is embedded in each of the following pages.
But it doesn’t appear in the ‘body’ text of those pages.
It does appear in the ‘body’ text of at least four pages, from which we could extract it with libphonenumber, or we could use the entities.
WITH url_with_phones AS (
SELECT *,
REGEXP_REPLACE(name, ' ', '') as phone_no_spaces
FROM `cpto-content-metadata.named_entities.named_entities_all`
WHERE type = "PHONE"
)
SELECT *
FROM url_with_phones
WHERE phone_no_spaces IN ("+3545505105")
If this can be done in javascript instead of Ruby or Python, then we don't even need to create cloud functions. We can define them directly in BigSQL. Google's own javascript library isn't useable for this, because it's difficult to install, and anyway it doesn't support find()
to find numbers in general text. But Google themselves link to two forks that we could use instead.
find()
and is bundled in a way that we could host in a bucketDone by #571
https://trello.com/c/mSjX7Dq0/2353-think-about-phone-numbers
Users seem to often search for phone numbers, but they might not be getting a complete set of results. This could lead to outdated phone numbers remaining on GOV.UK.
We ought to consider mitigating this problem.
Evidence of user need: https://docs.google.com/document/d/1auqzEXTiwAgNPG6PfxDDG7rSt7ImShKxEOo_zBaUsAE/edit#heading=h.aowahbaamoq
One way would be to use Google's libphonenumber to detect phone numbers in GOV.UK content, and format them in a standard form. Similarly, format searched-for phone numbers in a standard form, and then do a lookup.
The script below uses a python implementation of libphonenumber. I wrote my own python bindings to Google's C++ implementation, and it had exactly the same performance :smiling_face_with_tear: