query="""
SELECT source, REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title FROM
(SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
title
FROM
bigquery-public-data.hacker_news.stories
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""
I created using code from: https://datalab.office.datisan.com.au/notebooks/training-data-analyst/blogs/textclassification/txtcls.ipynb
as:
query=""" SELECT source, REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title FROM (SELECT ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.'))[OFFSET(1)] AS source, title FROM
bigquery-public-data.hacker_news.stories
WHERE REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.com$') AND LENGTH(title) > 10 ) WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch') """from google.cloud import bigquery client = bigquery.Client() df = client.query(query).to_dataframe() df.to_csv('titles_full.csv', header=False, index=False, encoding='utf-8', sep=',')
I had to swap the column order: COLUMNS = ['source', 'title']
without it loss was minimised after 20.
"some stuff here about setting up Eval jobs"