GoogleCloudPlatform / training-data-analyst

Labs and demos for courses for GCP Training (http://cloud.google.com/training).
Apache License 2.0
7.93k stars 5.89k forks source link

OFFSET SQL function yields `400 Array index 1 is out of bounds` while extracting info from `hacker_news` dataset #2434

Open MrCsabaToth opened 1 year ago

MrCsabaToth commented 1 year ago

Many SQL sections in various notebooks where the instructions explore the information in the dataset uses OFFSET(1) while trying to extract the domain name stem as the source. Three labs are mentioned in the #2432 issue (with their name and Cloud Skills Boost URL) along with their notebook, but there are many more notebooks. Example query cell:

%%bigquery --project $PROJECT

SELECT
    ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
    COUNT(title) AS num_articles
FROM
    `bigquery-public-data.hacker_news.full`
WHERE
    REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.*://(.[^/]+)/'), '.com$')
    AND LENGTH(title) > 10
GROUP BY
    source
ORDER BY num_articles DESC
  LIMIT 100

Resulting error:

ERROR:
 400 Array index 1 is out of bounds (overflow)

Location: US
Job ID: 389a7292-2c3b-4f14-8129-af10d4270423

A workaround is to use SAFE_OFFSET instead of OFFSET. A few other notebooks use that, and all notebooks use that in the https://github.com/GoogleCloudPlatform/asl-ml-immersion/ repo. I'll amend the PR#2433 with this.

MrCsabaToth commented 1 year ago

The https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive2/text_classification/labs/reusable_embeddings.ipynb uses the safe_offset call already.