mlops_pipeline_tf_agents_bandits_movie_recommendation.ipynb

iampawanpoojary commented 2 years ago

Expected Behavior

In the Author and run the RL pipeline part, once we submit the pipeline Training should not error out

Actual Behavior

No such object: gs://<bucket>/pipeline/<>/movielens-pipeline-startup-<>/train-reinforcement-learning-policy_<>/training_artifacts_dir; Failed to read GCS file: gs://<bucket>/pipeline/<>/movielens-pipeline-startup-<>/train-reinforcement-learning-policy_<>/training_artifacts_dir.; Failed to read output parameter training_artifacts_dir with spec type: STRING ; Failed to get and update task output.; Failed to refresh external task state

The pipeline fails during the training step,

I am thinking if this might be related https://github.com/GoogleCloudPlatform/vertex-ai-samples/pull/19#discussion_r683013115

@KathyFeiyang if you are still maintaining this, is this something you have run into? I tried using Str after checking the discussion above it does not throw an error but i guess it still does not work as i get an error on the next step

Steps to Reproduce the Problem

Ran the notebook as is with required parameters

Specifications

Platform: Vertex AI workbench notebooks

kweinmeister commented 2 years ago

Hi @iampawanpoojary, per discussion in #19, is your bucket in the same region as the service (looks like us-central1 from the screenshot)? And it's a regional (not multi-regional) bucket?

iampawanpoojary commented 2 years ago

@kweinmeister Yes, its us-central1 for both service and bucket. Also regional.

andrewferlitsch commented 2 years ago

I ran the notebook. Unrelated to the posted problem, it is missing pip installs for tf-agents and fastapi.

Also missing the service account setup needed for pipelines.

iampawanpoojary commented 2 years ago

I ran the notebook. Unrelated to the posted problem, it is missing pip installs for tf-agents and fastapi.

Also missing the service account setup needed for pipelines.

Yes, i ran into those as well, but i just installed the missing stuff, did you run the training pipeline too? any suggestions on what might be wrong with it? so i can look into it

KathyFeiyang commented 2 years ago

Hi @iampawanpoojary, thank you for submitting this issue. I think the first step of checking can be to directly access all the paths mentioned in the error message, and see if they are valid. For instance, it seems that the first bucket path doesn't contain the "gs://" prefix.

iampawanpoojary commented 2 years ago

oh yes, the path for bucket is correct i just replaced the gs: while i was covering my bucket name after copying, i have not changed any code as such in the notebook

andrewferlitsch commented 2 years ago

Reassigning to the notebook contributer

matthaley commented 2 years ago

I'm having the same issue. When I go to "View pipeline proto" for debugging info, here is the relevant logging for the error during train-reinforcement-learning-policy: "error": { "code": 5, "message": "No such object: tf_agents_example/pipeline/[project]/movielens-pipeline-startup-20220111010531/train-reinforcement-learning-policy_-9056912573479256064/training_artifacts_dir; Failed to read GCS file: gs://tf_agents_example/pipeline/[project]/movielens-pipeline-startup-20220111010531/train-reinforcement-learning-policy_-9056912573479256064/training_artifacts_dir.; Failed to read output parameter training_artifacts_dir with spec type: STRING\n; Failed to get and update task output.; Failed to refresh external task state. Task:Project number: [project], Job id: 7110378969106481152, Task id: -9056912573479256064, Task name: train-reinforcement-learning-policy, Task state: RUNNING_EXECUTOR, Execution name: projects/[project]/locations/us-central1/metadataStores/default/executions/16194131858485586851; Failed to handle the pipeline task. Task: Project number: [project], Job id: 7110378969106481152, Task id: -9056912573479256064, Task name: train-reinforcement-learning-policy, Task state: RUNNING_EXECUTOR, Execution name: projects/[project]/locations/us-central1/metadataStores/default/executions/16194131858485586851" }, My bucket has tf_agents_example/pipeline/[project]/movielens-pipeline-startup-20220111010531/, but not the rest of the path given in the error.

KathyFeiyang commented 2 years ago

Thank you for bringing this to our attention.

There's a quick fix that resolved this error on my end: It seems that the output parameter "training_artifacts_dir" of the Trainer component is causing issues. You may temporarily remove that output parameter with the following steps:

Go to src/trainer/trainer_component.py's train_reinforcement_learning_policy() function, remove the output parameter "training_artifacts_dir" by removing the output signature -> NamedTuple("Outputs", [("training_artifacts_dir", str),]) and return statement return outputs(training_artifacts_dir). You can double check on whether the removal is successful by checking that there are no "Output Parameters" listed for the training node in the pipeline visualization console page ("Pipeline run analysis" column).

Go to the notebook code cell that authors the pipeline. Use the following code snippet to load a component op train_reinforcement_learning_policy_op from the Trainer code you just modified, and use train_reinforcement_learning_policy_op in place of train_op


from kfp.components import create_component_from_func

from src.trainer.trainer_component import train_reinforcement_learning_policy

train_reinforcement_learning_policy_op = create_component_from_func(
    func=train_reinforcement_learning_policy,
    base_image="tensorflow/tensorflow:2.5.0",
    output_component_file="component.yaml",
    packages_to_install=[
      "tensorflow==2.5.0",
      "tf-agents==0.8.0",
    ],
)

One caveat is that train_task.outputs["training_artifacts_dir"] will no longer exist because we have removed the output parameter. For downstream components that use this variable, directly use training_artifacts_dir instead.

matthaley commented 2 years ago

@KathyFeiyang Thanks for your help. I found out that there is something wrong with the provisioning step for the train task. I can get the pipeline working by commenting out this code in the pipeline authoring step, thus using the default machine spec (e2-standard-4 instead of n1-standard-4):

    # worker_pool_specs = [
    #     {
    #         "containerSpec": {
    #             "imageUri": train_task.container.image,
    #         },
    #         "replicaCount": TRAINING_REPLICA_COUNT,
    #         "machineSpec": {
    #             "machineType": TRAINING_MACHINE_TYPE,
    #             "acceleratorType": TRAINING_ACCELERATOR_TYPE,
    #             "acceleratorCount": TRAINING_ACCELERATOR_COUNT,
    #         },
    #     },
    # ]
    # train_task.custom_job_spec = {
    #     "displayName": train_task.name,
    #     "jobSpec": {
    #         "workerPoolSpecs": worker_pool_specs,
    #     },
    # }

If I have a chance, I'll try to figure out how to get it working with the desired machine type and other parameters.

FYI, I also had to add Pillow to the dependencies for the training container, as I was getting this error from an import in tf_agents:

2022-01-12T01:14:09.820946396Z from tf_agents.utils import example_encoding
Error
2022-01-12T01:14:09.820952055Z File "/usr/local/lib/python3.6/dist-packages/tf_agents/utils/example_encoding.py", line 27, in <module>
Error
2022-01-12T01:14:09.820985189Z from PIL import Image
Error
2022-01-12T01:14:09.820992349ZModuleNotFoundError: No module named 'PIL'

KathyFeiyang commented 2 years ago

@matthaley I'm glad you got the pipeline working. Also, thank you so much for sharing these insights. We've taken a note of them and will work on updating the sample notebook.

yinghsienwu commented 2 years ago

See https://github.com/GoogleCloudPlatform/vertex-ai-samples/pull/270

GoogleCloudPlatform / vertex-ai-samples