RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

OpenTelemetry (OTel) logging in ARAX #2186

Closed saramsey closed 10 months ago

saramsey commented 12 months ago

Hi all,

The latest three-month development milestones for Translator are calling for all ARAs and KPs to implement OpenTelemetry logging of web API calls, by end of December. We will need to do this for ARAX and (I suppose, insofar as it does call PloverDB via a web API) RTX-KG2.

saramsey commented 11 months ago

I've reached out to Yaphet to better understand the requirements, and I have learned the following:

  1. Both ARAs and KPs are expected to implement OpenTelemetry logging
  2. Both incoming and outcoming web requests are to be logged via OpenTelemetry
  3. There will be one Jaeger server per ITRB maturity level, which I guess (from within ITRB) we can address like this:
    jaeger_host: "jaeger-otel-agent.sri"
    jaeger_port: "6831"

    Unclear to which Jaeger service we can log, from services running on arax.ncats.io (I need to find out).

saramsey commented 11 months ago

Still discussing with Yaphet how to get a Jaeger server (addressable on the Internet) spun up that we can use in development work.

saramsey commented 11 months ago

Latest info from Yaphet is:

Hi @Chris Bizon (SRI, Ranking Agent), @ramseyst , we could use the docker compose file here https://github.com/TranslatorSRI/Jaeger-demo/blob/main/jaeger-docker-compose.yaml and point the code to port 4318 as seen here (https://github.com/TranslatorSRI/Jaeger-demo/blob/main/service-C/server.py#L41) that would work, or the docker image itself can be stood up with out the docker compose file as outline here (https://www.jaegertracing.io/docs/1.50/getting-started/#all-in-one)

dkoslicki commented 11 months ago

@kvnthomas98 This is something that NCATS wants quite soon. Is this something you can work on, pausing your other MVP2 work?

saramsey commented 11 months ago

Thank you @dkoslicki

kvnthomas98 commented 11 months ago

Sure @dkoslicki I will look it into this.

kvnthomas98 commented 11 months ago

Hi @saramsey ,

Do you have the ITRB jaeger endpoints for each maturity level?

saramsey commented 10 months ago

We don't yet have access to an already-running (as in, provided for us by the SRI team) Jaeger endpoint that is on the Internet (though I understand that Yaphet is researching how to set that up).

However, within an ITRB-deployed container, I am told that the following OpenTelemetry configuration should work, and the hostname should resolve:

  jaeger_host: "jaeger-otel-agent.sri"
  jaeger_port: "6831"

I have not tested that, however. And it seems (to me) not ideal if our only way to test it out, is to deploy to ITRB CI.

edeutsch commented 10 months ago

Note that they are working on some documentation here: https://github.com/NCATSTranslator/TranslatorTechnicalDocumentation/pull/53/files

It would likely be useful to read that and provide feedback/comments. It is an open PR.

saramsey commented 10 months ago

@kvnthomas98 I think the documentation that @edeutsch linked is helpful; it explains how to run a local Jaeger, which we can use in development and testing. I was hoping that SRI would provide us with an Internet-addressable Jaeger endpoint that we could use in testing, but apparently there are some issues with that (so it is still pending). In the meantime, I think maybe we should try moving forward with using a "local Jaeger" for development and testing on arax.ncats.io. See this section of the documentation that Eric linked: https://github.com/NCATSTranslator/TranslatorTechnicalDocumentation/blob/214dcfef8465c95c1f68b0f62549b43442c23a30/docs/deployment-guide/monitoring.md?plain=1#L25-L47

edeutsch commented 10 months ago

Easier to read here: https://github.com/NCATSTranslator/TranslatorTechnicalDocumentation/blob/telemetry-FAQ/docs/deployment-guide/monitoring.md

saramsey commented 10 months ago

@kvnthomas98 what kind of EC2 instance would you need for hosting a Jaeger collector? Can you describe the hardware requirements? And storage requirements? Also what version of Ubuntu? I think we typically use Ubuntu 22.04?

saramsey commented 10 months ago

Hi all, from the Translator Release Schedule Timeline Google sheet, it's looking like we have two weeks to code this issue and get it into CI; I think the opportunity to push these updates to TEST will be on Dec. 15.

dkoslicki commented 10 months ago

Oof, Kevin is currently working on an ordering and organizing ask that also has the same deadline.

saramsey commented 10 months ago

Looks like the previous commit was to the issue2186 branch (thank you @kvnthomas98 )

kvnthomas98 commented 10 months ago

Hi @saramsey, Sorry I missed the message. For hardware. requirements a m5.large should do since the collector is lightweight and we don't have a crazy load, If we want to be cautious m5.xlarge should do. Please do share your thoughts. For storage I was thinking we could use elastic search. Regarding storage volume, I have no idea how much storage we need.

Once you've brought up the instance, please do let me know. I can work on setting up docker, jaeger collector and elastic search and testing our ARAX code.

saramsey commented 10 months ago

Hi @kvnthomas98 I have created an m5.large instance jaeger.rtx.ai in the us-east-1 region, with 64 GiB of EBS storage. I set up the AWS security group policy for the instance to allow ingress packets to ports 16686/tcp (Jaeger front-end) and 4318/tcp (Jaeger OTel via HTTP) from the CIDER block 35.81.149.105/32 (i.e., from arax.ncats.io). I installed your SSH RSA public key into the instance so you should be able to log into it from the command-line via

ssh -o StrictHostKeyChecking=no ubuntu@jaeger.rtx.ai

I've installed docker (from docker.io) into the instance, and I've already pulled the Docker image jaegertracing/all-in-one from DockerHub via

sudo docker pull jaegertracing/all-in-one

You can run Jaeger locally via the command (which is adapted from the one in the installation instructions on the Jaeger website):

 sudo docker run --rm --name jaeger  \
                             -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
                             -p 6831:6831/udp \
                             -p 6832:6832/udp  \
                             -p 5778:5778 \
                             -p 16686:16686 \
                             -p 4317:4317 \
                             -p 4318:4318 \ 
                             -p 14250:14250 \
                             -p 14268:14268 \
                             -p 14269:14269 \ 
                             -p 9411:9411 \
                             jaegertracing/all-in-one:latest

I've put that command in a shell-script /home/ubuntu/run-jaeger.sh. So, when I run the aforementioned shell script, the Jaeger front-end is reachable and works, as shown here: Screenshot 2023-12-02 at 7 47 40 AM

and I can make a TCP connection from arax.ncats.io to port 16686 on the Jaeger server in us-east-1,

stephenr@ip-172-31-53-16:~$ nc -v jaeger.rtx.ai 16686
Connection to jaeger.rtx.ai 16686 port [tcp/*] succeeded!

and to port 4318 as well:

stephenr@ip-172-31-53-16:~$ nc -v jaeger.rtx.ai 4318
Connection to jaeger.rtx.ai 4318 port [tcp/*] succeeded!

Just to minimize cost, I've opted to stop the instance until we are ready to test it.

So whenever you want to test out OpenTelemetry, simply do the following three steps:

  1. In the AWS Console, go to EC2 and start the jaeger.rtx.ai instance
  2. From your local computer, run ssh ubuntu@jaeger.rtx.ai ./run-jaeger.sh, which should start Jaeger
  3. Test OpenTelemetry as you like, pointing the OTel data stream at jaeger.rtx.ai:4318.
  4. When you are done, I suppose we can stop the jaeger.rtx.ai instance (until such time as we deploy telemetry to arax.ncats.io and we need jaeger.rtx.ai to be running all the time).
saramsey commented 10 months ago

To be clear, for security reasons, I have locked down the security group policy on jaeger.rtx.ai, though we can allow other IPs to connect as well, if need be:

Screenshot 2023-12-02 at 4 13 22 PM

saramsey commented 10 months ago

The following simple demonstration python code snippet, run in python3.9 inside the rtx1 container on arax.ncats.io, successfully logs a message to our Jaeger server on jaeger.rtx.ai:

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
#from opentelemetry.sdk.resources import SERVICE_NAME as telemetery_service_name_key
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
jaeger_host = 'jaeger.rtx.ai'
jaeger_port = 6831
trace.set_tracer_provider(TracerProvider(resource=Resource.create({'bar': 'foo'})))
jaeger_exporter = JaegerExporter(agent_host_name=jaeger_host, agent_port=jaeger_port)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(jaeger_exporter))
tracer = trace.get_tracer("test_otel.py")
with tracer.start_as_current_span("span-name") as span:
    # do some work that 'span' will track
    print("doing some work...")

The above python code (which was adapted from a demo program from the ARAGORN team) requires the following packages to be installed:

(venv) rt@d1fd345478a0:~$ pip freeze
annotated-types==0.6.0
anyio==3.7.1
asgiref==3.7.2
backoff==2.2.1
certifi==2023.11.17
charset-normalizer==3.3.2
Deprecated==1.2.14
exceptiongroup==1.2.0
fastapi==0.104.1
googleapis-common-protos==1.59.1
grpcio==1.59.3
h11==0.14.0
httpcore==1.0.2
httpx==0.25.2
idna==3.6
importlib-metadata==6.11.0
opentelemetry-api==1.21.0
opentelemetry-exporter-jaeger==1.21.0
opentelemetry-exporter-jaeger-proto-grpc==1.21.0
opentelemetry-exporter-jaeger-thrift==1.21.0
opentelemetry-exporter-otlp==1.21.0
opentelemetry-exporter-otlp-proto-common==1.21.0
opentelemetry-exporter-otlp-proto-grpc==1.21.0
opentelemetry-exporter-otlp-proto-http==1.21.0
opentelemetry-instrumentation==0.42b0
opentelemetry-instrumentation-asgi==0.42b0
opentelemetry-instrumentation-fastapi==0.42b0
opentelemetry-instrumentation-httpx==0.42b0
opentelemetry-proto==1.21.0
opentelemetry-sdk==1.21.0
opentelemetry-semantic-conventions==0.42b0
opentelemetry-util-http==0.42b0
protobuf==4.25.1
pydantic==2.5.2
pydantic_core==2.14.5
requests==2.31.0
six==1.16.0
sniffio==1.3.0
starlette==0.27.0
thrift==0.16.0
typing_extensions==4.8.0
urllib3==2.1.0
wrapt==1.16.0
zipp==3.17.0

Note, not all of the imported packages are used in the code snippet; so the code and the required packages could be simplified somewhat, and also will likely change for us in any event because we use python-requests instead of httpx. But, it illustrates that the opentelemetry SDK is working for sending spans (or messages or whatever they are called) to our Jaeger collector on jaeger.rtx.ai:

(venv) rt@d1fd345478a0:~$ python3 test_otel.py
/home/rt/test_otel.py:12: DeprecationWarning: Call to deprecated method __init__. (Since v1.35, the Jaeger supports OTLP natively. Please use the OTLP exporter instead. Support for this exporter will end July 2023.) -- Deprecated since version 1.16.0.
  jaeger_exporter = JaegerExporter(agent_host_name=jaeger_host, agent_port=jaeger_port)
doing some work...

And the view from the Jaeger frontend:

Screenshot 2023-12-05 at 5 01 11 PM

edeutsch commented 10 months ago

wow, that's really adding a lot of... complexity

saramsey commented 10 months ago

Thank you @kvnthomas98 for putting together this PR.

kvnthomas98 commented 10 months ago

Telemetry Instrumentation code has been added and merged to master Jaeger UI on both ITRB CI and jaeger.rtx.ai show traces from the telemetry sent over. Thanks @saramsey and @edeutsch for the help!

kvnthomas98 commented 10 months ago

code pushed to ITRB-Test! closing!