GoogleCloudPlatform / ml-design-patterns

Source code accompanying O'Reilly book: Machine Learning Design Patterns
Apache License 2.0
1.89k stars 532 forks source link

Chapter 5: Continued Evaluation: Dataset Access, EarlyStopping, Evaluation #10

Closed mshearer0 closed 4 years ago

mshearer0 commented 4 years ago
  1. The munn-sandbox is not publically available so the txtcls is not available.

I created using code from: https://datalab.office.datisan.com.au/notebooks/training-data-analyst/blogs/textclassification/txtcls.ipynb

as:

query=""" SELECT source, REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title FROM (SELECT ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.'))[OFFSET(1)] AS source, title FROM bigquery-public-data.hacker_news.stories WHERE REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.com$') AND LENGTH(title) > 10 ) WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch') """

from google.cloud import bigquery client = bigquery.Client() df = client.query(query).to_dataframe() df.to_csv('titles_full.csv', header=False, index=False, encoding='utf-8', sep=',')

I had to swap the column order: COLUMNS = ['source', 'title']

  1. With EarlyStopping enabled training finished after just 2 Epochs

callbacks=[EarlyStopping(), TensorBoard(model_dir)],

without it loss was minimised after 20.

  1. Evaluation job section is 'to-do':

"some stuff here about setting up Eval jobs"

mshearer0 commented 4 years ago

Closed as new version updated since download