GoogleCloudPlatform / cloudml-samples

Cloud ML Engine repo. Please visit the new Vertex AI samples repo at https://github.com/GoogleCloudPlatform/vertex-ai-samples
https://cloud.google.com/ai-platform/docs/
Apache License 2.0
1.52k stars 859 forks source link

collaborative filter example with ai-platform #452

Closed victusfate closed 5 years ago

victusfate commented 5 years ago

I had a previous example of using gcloud which has been deprecated (https://github.com/GoogleCloudPlatform/tensorflow-recommendation-wals) modified to work with python 3.x, and custom data (gcs). I was unable to unravel how gcloud ml-engine arguments have been migrated to gcloud ai-platform.

I began working on this, and using the docs I was able to find how some of the arguments have evolved but it's not as transparent as I'd prefer.

Any chance there will be a new collaborative filter example (with movielens is fine) with gcloud ai-platform? Was this not a common use case?

Looking through the updated examples I found a reference where users are invited to migrate over the examples. https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/tensorflow/standard/legacy/movielens

Without a minimal example, it's hard to understand what has changed in the interface. I'm reading through the docs https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction. I'm concerned reverse engineering the api changes could be a shared adoption problem.

nnegrey commented 5 years ago

I believe, the gcloud ml-engine --> gcloud ai-platform should have just been a name swap and the arguments should have remained the same. What sort of differences are you seeing?

gcloud ml-engine docs: https://cloud.google.com/sdk/gcloud/reference/ml-engine/

gcloud ai-platform docs: https://cloud.google.com/sdk/gcloud/reference/ai-platform/

victusfate commented 5 years ago

I ended up working through the differences locally. Thanks for commenting quickly @nnegrey !

I can check my local commit history and share what I altered. Still working on the remote (gcloud) training. Once that's working I'll do a side by side compare and share what args changed

victusfate commented 5 years ago

just a few so far --use-optimized is not recognized --verbose-logging is now --verbosity --train-files is --train-file using a --scale-tier custom requires a --master-machine-type specified (the machine type used to be setup up in the task code, but now needs to be specified up front)

I may have more changes to share after I get training to run remote again

nnegrey commented 5 years ago

What version of gcloud are you running?

victusfate commented 5 years ago

Away from my desk right now but installed the latest referred to here last week

https://github.com/Homebrew/formulae.brew.sh/blob/0b76ff9aa37d43ad4a0fce17d1eee2783459e376/_data/cask/google-cloud-sdk.json

Will get you a version number at home

nnegrey commented 5 years ago

So looking at the args from: https://github.com/GoogleCloudPlatform/tensorflow-recommendation-wals. Quite a few of those are not a part of gcloud itself.

Such as: --use-optimized, --train-files or --train-file, --verbose-logging Those are user defined args from the first example you linked.

Any args after the line -- are all passed to your trainer python script which has to do it's own arg parsing. Example from the first link.

gcloud ml-engine jobs submit training ${JOB_NAME} \
    --region ${REGION} \
    --scale-tier=CUSTOM \
    --job-dir ${BUCKET}/jobs/${JOB_NAME} \
    --module-name trainer.task \
    --package-path trainer \
    --config ${CONFIG_TUNE} \
    -- \ [NOTE: ALL ARGS ABOVE THIS ARE PART OF GCLOUD, ALL ARGS BELOW ARE USER DEFINED)
    --hypertune \
    ${ARGS}

User defined args are used to help set things like epochs, learning-rate or other such fields as you define.

For some more info: https://cloud.google.com/ml-engine/docs/training-jobs#submit-job

Let me know if you have any other questions.

victusfate commented 5 years ago

thanks @nnegrey I'll tackle those tomorrow

nnegrey commented 5 years ago

Yep! And ml-engine / ai-platform are interchangeable (assuming your training application code is the same nothing should change if it works on ml-engine. ML-Engine was re-branded to AI-Platform, so we updated the gcloud command. Which is why you see that warning to switch to ai-platform