GoogleCloudPlatform / generative-ai

Sample code and notebooks for Generative AI on Google Cloud, with Gemini on Vertex AI
https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview
Apache License 2.0
7.28k stars 1.97k forks source link

[Bug]: Deploy the service (reasoning_engines.LangchainAgent) causes InternalServerError (in Argolis project) #915

Closed jeromemassot closed 3 months ago

jeromemassot commented 3 months ago

File Name

tutorial_alloydb_rag_agent.ipynb

What happened?

remote_app = reasoning_engines.ReasoningEngine.create(
    agent,
    requirements=[
        "google-cloud-aiplatform[reasoningengine,langchain]==1.57.0",
        "langchain-google-alloydb-pg==0.4.1",
        "langchain-google-vertexai==1.0.4",
    ],
    display_name="PrebuiltAgent",
)

returns an InternalServerError: 500 The user created Reasoning Engine failed to start and cannot serve traffic. 13: The user created Reasoning Engine failed to start and cannot serve traffic when running in an Argolis project.

However, the AlloyDB is setup correctly and can be accessed. And other LangChain agents have been deployed in this Argolis project without any issue.

Relevant log output

No response

Code of Conduct

koverholt commented 3 months ago

I attempted to reproduce the problem by specifying and deploying an agent using your sample code, and the deployment worked for me as expected.

Ensure that you are using the same library versions in your development environment (e.g., wherever you are deploying the agent from). This can sometimes happen due to a mismatch between the Python library versions on the client side (i.e., dev environment) vs. server-side (i.e., Reasoning Engine deployed service).

If you still run into the issues after checking that, then you'll want to view your deployment logs in the Logs Explorer in the Console. Try looking there for any error messages related to deployment, and you might try filtering the logs by searching for the Reasoning Engine ID that you see during deployment:

create ReasoningEngine backing LRO: projects/962751530XXX/locations/us-central1/reasoningEngines/3471395703400431616/operations/4369265454217166848

When viewing the logs, you can also verify that the Python version in Reasoning Engine is the same that you expect based on the development environment. Or you can look for other package installation, conflicts, or other issues in the logs. The Python version gets inferred on the client side when deploying, and you can also specify sys_version as an argument to reasoning_engines.ReasoningEngine.create if it's failing to auto-detect the correct Python version.

draffensperger commented 3 months ago

I ran into this issue today as well and then found this GitHub issue. What I did was to look for the logs in Cloud Logging (which didn't appear right away for some reason). I found them via logs query log_id(reasoning_engine%2Fstderr) OR log_id(reasoning_engine%2Fbuild). For some reason the logs took a little while to appear and also just searching for logs with resource.type="aiplatform.googleapis.com/ReasoningEngine" didn't seem to work (maybe a logging bug).

Anyway, once I found logs for my reasoning engine, I saw an error like this:

DEFAULT 2024-08-01T00:37:37.523307Z AttributeError: Can't get attribute '_class_setstate' on <module 'cloudpickle.cloudpickle' from '/usr/local/lib/python3.11/site-packages/cloudpickle/cloudpickle.py'> ERROR 2024-08-01T00:37:41.810830Z Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/code/app/__main__.py", line 28, in <module> main() File "/code/app/__main__.py", line 16, in main app=create_app()

I ran pip show to get the versions of google-cloud-aiplatform, langchain-google-vertexai and langchain-core, which were the three items inn the requirements section of my reasoning engine creation call (I was following the example in the these docs).

However, Gemini itself suggested that I might have version skew inn the cloudpickle package itself. So I ran pip show on cloudpickle and added a locked version to my requirements parameter for reasoning_engines.ReasoningEngine.create and voila, I got past the error.

koverholt commented 3 months ago

@draffensperger, thanks for posting the details of your experience. You are exactly right, if you get a serialization / cloudpickle error like that, it can help to pin the version of cloudpickle==3.0.0 to solve the issue.

I've included that version pin in notebooks such as https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/reasoning-engine/tutorial_vertex_ai_search_rag_agent.ipynb in both the pip install lines at the top of the notebook, as well as the package versions that get specified in the reasoning_engines.ReasoningEngine.create options.

We also try to document frequently encountered issues like this one at https://cloud.google.com/vertex-ai/generative-ai/docs/reasoning-engine/troubleshooting/deploy#cloudpickle_version.

jeromemassot commented 3 months ago

Thanks you both for this help regarding the issue with cloudpickle :) I confirm that on my side I had no clear error log but I did not make the same effort as @draffensperger to look for it. Best regards Jerome

jeromemassot commented 3 months ago

I have tried to add the cloudpickle locked version in the code, but I still have an issue.

remote_app = reasoning_engines.ReasoningEngine.create(
    agent,
    requirements=[
        "google-cloud-aiplatform[reasoningengine,langchain]==1.57.0",
        "langchain-google-alloydb-pg==0.4.1",
        "langchain-google-vertexai==1.0.4",
        "cloudpickle==3.0.0"
    ],
    display_name="PrebuiltAgent",
)

It seems that there is a LangChain module missing, this error is triggered by the cloudpickle.loads() call.

{
insertId: "66ab939b0003bc8d2a8876e6"
logName: "projects/education-and-tests-422020/logs/reasoning_engine%2Fstderr"
receiveTimestamp: "2024-08-01T13:54:35.541482941Z"
resource: {2}
severity: "ERROR"
textPayload: "Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/code/app/__main__.py", line 28, in <module>
    main()
  File "/code/app/__main__.py", line 16, in main
    app=create_app(),
  File "/code/app/api/app.py", line 54, in create_app
    router=python_file_api_builder.PythonFileApiBuilder(
  File "/code/app/api/factory/python_file_api_builder.py", line 43, in __init__
    self.obj = utils.get_object(self.file_name)
  File "/code/app/api/factory/utils.py", line 35, in get_object
    obj = get_local_object(obj_filename)
  File "/code/app/api/factory/utils.py", line 54, in get_local_object
    return cloudpickle.loads(f.read())
ModuleNotFoundError: No module named 'langchain_google_alloydb_pg.engine'"
timestamp: "2024-08-01T13:54:35.244877Z"
}
jeromemassot commented 3 months ago

In fact, I remove all the locks regarding the modules versions and it worked.

remote_app = reasoning_engines.ReasoningEngine.create(
    agent,
    requirements=[
        "google-cloud-aiplatform[reasoningengine,langchain]",
        "langchain-google-alloydb-pg",
        "langchain-google-vertexai",
        "cloudpickle==3.0.0"
    ],
    display_name="PrebuiltAgent",
)
koverholt commented 3 months ago

@jeromemassot, thanks for the update and letting us know the resolution. It might be the case that there were some transitive dependencies that were conflicting during the pip install. Glad to know that you were able to get it working!