Closed iampawanpoojary closed 2 years ago
Hi @iampawanpoojary, per discussion in #19, is your bucket in the same region as the service (looks like us-central1
from the screenshot)? And it's a regional (not multi-regional) bucket?
@kweinmeister Yes, its us-central1 for both service and bucket. Also regional.
I ran the notebook. Unrelated to the posted problem, it is missing pip installs for tf-agents and fastapi.
Also missing the service account setup needed for pipelines.
I ran the notebook. Unrelated to the posted problem, it is missing pip installs for tf-agents and fastapi.
Also missing the service account setup needed for pipelines.
Yes, i ran into those as well, but i just installed the missing stuff, did you run the training pipeline too? any suggestions on what might be wrong with it? so i can look into it
Hi @iampawanpoojary, thank you for submitting this issue. I think the first step of checking can be to directly access all the paths mentioned in the error message, and see if they are valid. For instance, it seems that the first bucket path doesn't contain the "gs://" prefix.
oh yes, the path for bucket is correct i just replaced the gs: while i was covering my bucket name after copying, i have not changed any code as such in the notebook
Reassigning to the notebook contributer
I'm having the same issue. When I go to "View pipeline proto" for debugging info, here is the relevant logging for the error during train-reinforcement-learning-policy:
"error": { "code": 5, "message": "No such object: tf_agents_example/pipeline/[project]/movielens-pipeline-startup-20220111010531/train-reinforcement-learning-policy_-9056912573479256064/training_artifacts_dir; Failed to read GCS file: gs://tf_agents_example/pipeline/[project]/movielens-pipeline-startup-20220111010531/train-reinforcement-learning-policy_-9056912573479256064/training_artifacts_dir.; Failed to read output parameter training_artifacts_dir with spec type: STRING\n; Failed to get and update task output.; Failed to refresh external task state. Task:Project number: [project], Job id: 7110378969106481152, Task id: -9056912573479256064, Task name: train-reinforcement-learning-policy, Task state: RUNNING_EXECUTOR, Execution name: projects/[project]/locations/us-central1/metadataStores/default/executions/16194131858485586851; Failed to handle the pipeline task. Task: Project number: [project], Job id: 7110378969106481152, Task id: -9056912573479256064, Task name: train-reinforcement-learning-policy, Task state: RUNNING_EXECUTOR, Execution name: projects/[project]/locations/us-central1/metadataStores/default/executions/16194131858485586851" },
My bucket has tf_agents_example/pipeline/[project]/movielens-pipeline-startup-20220111010531/, but not the rest of the path given in the error.
Thank you for bringing this to our attention.
There's a quick fix that resolved this error on my end: It seems that the output parameter "training_artifacts_dir" of the Trainer component is causing issues. You may temporarily remove that output parameter with the following steps:
Go to src/trainer/trainer_component.py
's train_reinforcement_learning_policy()
function, remove the output parameter "training_artifacts_dir" by removing the output signature -> NamedTuple("Outputs", [("training_artifacts_dir", str),])
and return statement return outputs(training_artifacts_dir)
. You can double check on whether the removal is successful by checking that there are no "Output Parameters" listed for the training node in the pipeline visualization console page ("Pipeline run analysis" column).
Go to the notebook code cell that authors the pipeline. Use the following code snippet to load a component op train_reinforcement_learning_policy_op
from the Trainer code you just modified, and use train_reinforcement_learning_policy_op
in place of train_op
from kfp.components import create_component_from_func
from src.trainer.trainer_component import train_reinforcement_learning_policy
train_reinforcement_learning_policy_op = create_component_from_func(
func=train_reinforcement_learning_policy,
base_image="tensorflow/tensorflow:2.5.0",
output_component_file="component.yaml",
packages_to_install=[
"tensorflow==2.5.0",
"tf-agents==0.8.0",
],
)
One caveat is that train_task.outputs["training_artifacts_dir"]
will no longer exist because we have removed the output parameter. For downstream components that use this variable, directly use training_artifacts_dir
instead.
@KathyFeiyang Thanks for your help. I found out that there is something wrong with the provisioning step for the train task. I can get the pipeline working by commenting out this code in the pipeline authoring step, thus using the default machine spec (e2-standard-4 instead of n1-standard-4):
# worker_pool_specs = [
# {
# "containerSpec": {
# "imageUri": train_task.container.image,
# },
# "replicaCount": TRAINING_REPLICA_COUNT,
# "machineSpec": {
# "machineType": TRAINING_MACHINE_TYPE,
# "acceleratorType": TRAINING_ACCELERATOR_TYPE,
# "acceleratorCount": TRAINING_ACCELERATOR_COUNT,
# },
# },
# ]
# train_task.custom_job_spec = {
# "displayName": train_task.name,
# "jobSpec": {
# "workerPoolSpecs": worker_pool_specs,
# },
# }
If I have a chance, I'll try to figure out how to get it working with the desired machine type and other parameters.
FYI, I also had to add Pillow to the dependencies for the training container, as I was getting this error from an import in tf_agents:
2022-01-12T01:14:09.820946396Z from tf_agents.utils import example_encoding
Error
2022-01-12T01:14:09.820952055Z File "/usr/local/lib/python3.6/dist-packages/tf_agents/utils/example_encoding.py", line 27, in <module>
Error
2022-01-12T01:14:09.820985189Z from PIL import Image
Error
2022-01-12T01:14:09.820992349ZModuleNotFoundError: No module named 'PIL'
@matthaley I'm glad you got the pipeline working. Also, thank you so much for sharing these insights. We've taken a note of them and will work on updating the sample notebook.
Expected Behavior
In the Author and run the RL pipeline part, once we submit the pipeline Training should not error out
Actual Behavior
No such object: gs://<bucket>/pipeline/<>/movielens-pipeline-startup-<>/train-reinforcement-learning-policy_<>/training_artifacts_dir; Failed to read GCS file: gs://<bucket>/pipeline/<>/movielens-pipeline-startup-<>/train-reinforcement-learning-policy_<>/training_artifacts_dir.; Failed to read output parameter training_artifacts_dir with spec type: STRING ; Failed to get and update task output.; Failed to refresh external task state
The pipeline fails during the training step,
I am thinking if this might be related https://github.com/GoogleCloudPlatform/vertex-ai-samples/pull/19#discussion_r683013115
@KathyFeiyang if you are still maintaining this, is this something you have run into? I tried using Str after checking the discussion above it does not throw an error but i guess it still does not work as i get an error on the next step
Steps to Reproduce the Problem
Specifications