NCATSTranslator / Feedback

A repo for tracking gaps in Translator data and finding ways to fill them.
7 stars 0 forks source link

Architecture request: design a "tool monitoring" architecture to warn of downtime before the user encounters it. #108

Closed sierra-moxon closed 4 months ago

sierra-moxon commented 1 year ago

From the TAQA meeting in response to #104 Can we implement something like "uptime robot" or "monit" to do a regular survey of the "uptime" of our tools that are used to serve the UI?

For architecture discussion.

gaurav commented 1 year ago

This is something I was planning to set up as an hourly GitHub Action (https://github.com/TranslatorSRI/NodeNormalization/issues/161), but if there's an easier/most sophisticated way of doing this, I'm all ears!

cbizon commented 1 year ago

We have implemented some uptimerobot pages for nodenorm. It seems to be very helpful & I think we should expand this solution.

edeutsch commented 1 year ago

FWIW, we have been using the free version of uptimerobot to monitor our systems with reasonably good results

image

cbizon commented 1 year ago

I think we know what we need to do here- create public uptime robot pages like this for

  1. Nodenorm
  2. NameRes in each environment.

Are there other tools that should be included?

MarkDWilliams commented 1 year ago

The /actors endpoint on the ARS is a fast way to check if it is up and running. https://ars-prod.transltr.io/ars/api/actors

Can notify Shervin and/or myself. Slack preferred. We have an existing uptimebot on our internal Slack.

edeutsch commented 1 year ago

For ARAX and RTX-KG2, you can use this for a fast basic functioning check: https://arax.transltr.io/api/arax/v1.3/entity?q=metformin https://kg2.transltr.io/api/rtxkg2/v1.3/entity?q=metformin Note that when TRAPI 1.4 is fully deployed to prod, you'll need to replace v1.3 to v1.4 in the URLs.

CaseyTa commented 1 year ago

https://cohd-api.transltr.io/health - notify Casey https://openpredict.transltr.io/health - notify Vincent https://collaboratory-api.transltr.io/health - notify Vincent

gaurav commented 1 year ago

NodeNorm can be tested with a GET request to https://nodenorm.transltr.io/1.3/get_normalized_nodes?curie=MESH%3AD014867&curie=NCIT%3AC34373&conflate=true -- the top-level URL will change to /1.4/ in the next release.

NameRes will get a GET endpoint in the next release (e.g. http://name-resolution-sri-dev.apps.renci.org/lookup?string=ath%2A&offset=0&limit=10&biolink_type=Disease&only_prefixes=MONDO%7CHP), but these haven't been pushed to RENCI Dev or ITRB as yet.

You can notify me (@gaurav) about both NodeNorm and NameRes downtime. I believe @cbizon is already using Uptime Robot to track NodeNorm.

newgene commented 1 year ago

Annotator

https://biothings.transltr.io/annotator/NCBIGene:1017

can notify me on slack or email

newgene commented 1 year ago

Service Provider

https://biothings.transltr.io/idisk/status

can notify me on slack or email

newgene commented 1 year ago

BTE ARA

https://bte.transltr.io/v1/smartapi/59dce17363dce279d389100834e43648/meta_knowledge_graph

can notify Jackson (@tokebe ) on slack or email

edgargaticaCU commented 1 year ago

TextMiningKP has UptimeRobot set up to check the cooccurrence API at https://cooccurrence.transltr.io/meta_knowledge_graph and the dev instance of DocumentMetadataAPI at https://3md2qwxrrk.us-east-1.awsapprunner.com/ Still working to set up ITRB environment instances of DocumentMetadataAPI, but the '/' endpoint will be the health check for all. You can notify me (@edgargaticaCU ) via slack or email.

tokebe commented 1 year ago

Adding a note that BTE has an existing monitor that we use internally, available here: https://stats.uptimerobot.com/7y2AFz1gv

marcdubybroad commented 1 year ago

Genetics KP:

https://genetics-kp.transltr.io/genetics_provider/trapi/v1.4/meta_knowledge_graph

Can notify me on Slack or email

webyrd commented 1 year ago

Unsecret Agent/mediKanren

Monitoring of these endpoints would be very helpful.

Prod:

https://medikanren-trapi.transltr.io/health

Test:

https://medikanren-trapi.test.transltr.io/health

CI:

https://medikanren-trapi.ci.transltr.io/health

You can notify me by Slack or email.

gglusman commented 1 year ago

I assume ours are covered by Service Provider. Or should we test each endpoint separately?

EvanDietzMorris commented 1 year ago

We have uptime robot set up already for Aragorn (https://stats.uptimerobot.com/rl5E2iwr7K), and are planning to add monitors for the platers/automat. If we add one for every plater on every maturity level of deployment we will surpass the free tier limit of 50 monitors.

Example: https://automat.renci.org/biolink/1.4/meta_knowledge_graph - notify Evan Morris

We also have our own uptime monitoring solution (https://github.com/ranking-agent/api-watchdog-translator-tests) which has no limits and can support more complex health checks (ie query results) than the free tier of uptime robot. We plan on adding more checks to this, regardless of uptime robot, but haven't for all of our services yet.

capasfield commented 1 year ago

Please use these updates for ARAX/KK2: https://arax.transltr.io/api/arax/v1.4/entity?q=metformin https://kg2.transltr.io/api/rtxkg2/v1.4/entity?q=metformin

uhbrar commented 1 year ago

ARAGORN

https://aragorn.transltr.io/1.4/docs

Notify Abrar Mesbah or Chris Bizon by slack or email

uhbrar commented 1 year ago

Workflow Runner

https://translator-workflow-runner.transltr.io/services

Notify Abrar Mesbah by slack or email

EvanDietzMorris commented 1 year ago

This is a full list for Automat, notify Evan Morris through slack or email.

https://automat.renci.org/biolink/1.4/meta_knowledge_graph https://automat.ci.transltr.io/biolink/1.4/meta_knowledge_graph https://automat.test.transltr.io/biolink/1.4/meta_knowledge_graph https://automat.transltr.io/biolink/1.4/meta_knowledge_graph https://automat.renci.org/cam-kp/1.4/meta_knowledge_graph https://automat.ci.transltr.io/cam-kp/1.4/meta_knowledge_graph https://automat.test.transltr.io/cam-kp/1.4/meta_knowledge_graph https://automat.transltr.io/cam-kp/1.4/meta_knowledge_graph https://automat.renci.org/ctd/1.4/meta_knowledge_graph https://automat.ci.transltr.io/ctd/1.4/meta_knowledge_graph https://automat.test.transltr.io/ctd/1.4/meta_knowledge_graph https://automat.transltr.io/ctd/1.4/meta_knowledge_graph https://automat.renci.org/drugcentral/1.4/meta_knowledge_graph https://automat.ci.transltr.io/drugcentral/1.4/meta_knowledge_graph https://automat.test.transltr.io/drugcentral/1.4/meta_knowledge_graph https://automat.transltr.io/drugcentral/1.4/meta_knowledge_graph https://automat.renci.org/genome-alliance/1.4/meta_knowledge_graph https://automat.ci.transltr.io/genome-alliance/1.4/meta_knowledge_graph https://automat.test.transltr.io/genome-alliance/1.4/meta_knowledge_graph https://automat.transltr.io/genome-alliance/1.4/meta_knowledge_graph https://automat.renci.org/gtex/1.4/meta_knowledge_graph https://automat.ci.transltr.io/gtex/1.4/meta_knowledge_graph https://automat.test.transltr.io/gtex/1.4/meta_knowledge_graph https://automat.transltr.io/gtex/1.4/meta_knowledge_graph https://automat.renci.org/gtopdb/1.4/meta_knowledge_graph https://automat.ci.transltr.io/gtopdb/1.4/meta_knowledge_graph https://automat.test.transltr.io/gtopdb/1.4/meta_knowledge_graph https://automat.transltr.io/gtopdb/1.4/meta_knowledge_graph https://automat.renci.org/gwas-catalog/1.4/meta_knowledge_graph https://automat.ci.transltr.io/gwas-catalog/1.4/meta_knowledge_graph https://automat.test.transltr.io/gwas-catalog/1.4/meta_knowledge_graph https://automat.transltr.io/gwas-catalog/1.4/meta_knowledge_graph https://automat.renci.org/hetio/1.4/meta_knowledge_graph https://automat.ci.transltr.io/hetio/1.4/meta_knowledge_graph https://automat.test.transltr.io/hetio/1.4/meta_knowledge_graph https://automat.transltr.io/hetio/1.4/meta_knowledge_graph https://automat.renci.org/hgnc/1.4/meta_knowledge_graph https://automat.ci.transltr.io/hgnc/1.4/meta_knowledge_graph https://automat.test.transltr.io/hgnc/1.4/meta_knowledge_graph https://automat.transltr.io/hgnc/1.4/meta_knowledge_graph https://automat.renci.org/hmdb/1.4/meta_knowledge_graph https://automat.ci.transltr.io/hmdb/1.4/meta_knowledge_graph https://automat.test.transltr.io/hmdb/1.4/meta_knowledge_graph https://automat.transltr.io/hmdb/1.4/meta_knowledge_graph https://automat.renci.org/human-goa/1.4/meta_knowledge_graph https://automat.ci.transltr.io/human-goa/1.4/meta_knowledge_graph https://automat.test.transltr.io/human-goa/1.4/meta_knowledge_graph https://automat.transltr.io/human-goa/1.4/meta_knowledge_graph https://automat.renci.org/icees-kg/1.4/meta_knowledge_graph https://automat.ci.transltr.io/icees-kg/1.4/meta_knowledge_graph https://automat.test.transltr.io/icees-kg/1.4/meta_knowledge_graph https://automat.transltr.io/icees-kg/1.4/meta_knowledge_graph https://automat.renci.org/intact/1.4/meta_knowledge_graph https://automat.ci.transltr.io/intact/1.4/meta_knowledge_graph https://automat.test.transltr.io/intact/1.4/meta_knowledge_graph https://automat.transltr.io/intact/1.4/meta_knowledge_graph https://automat.renci.org/panther/1.4/meta_knowledge_graph https://automat.ci.transltr.io/panther/1.4/meta_knowledge_graph https://automat.test.transltr.io/panther/1.4/meta_knowledge_graph https://automat.transltr.io/panther/1.4/meta_knowledge_graph https://automat.renci.org/pharos/1.4/meta_knowledge_graph https://automat.ci.transltr.io/pharos/1.4/meta_knowledge_graph https://automat.test.transltr.io/pharos/1.4/meta_knowledge_graph https://automat.transltr.io/pharos/1.4/meta_knowledge_graph https://automat.renci.org/robokopkg/1.4/meta_knowledge_graph https://automat.ci.transltr.io/robokopkg/1.4/meta_knowledge_graph https://automat.test.transltr.io/robokopkg/1.4/meta_knowledge_graph https://automat.transltr.io/robokopkg/1.4/meta_knowledge_graph https://automat.renci.org/sri-reference-kg/1.4/meta_knowledge_graph https://automat.ci.transltr.io/sri-reference-kg/1.4/meta_knowledge_graph https://automat.test.transltr.io/sri-reference-kg/1.4/meta_knowledge_graph https://automat.transltr.io/sri-reference-kg/1.4/meta_knowledge_graph https://automat.renci.org/string-db/1.4/meta_knowledge_graph https://automat.ci.transltr.io/string-db/1.4/meta_knowledge_graph https://automat.test.transltr.io/string-db/1.4/meta_knowledge_graph https://automat.transltr.io/string-db/1.4/meta_knowledge_graph https://automat.renci.org/ubergraph/1.4/meta_knowledge_graph https://automat.ci.transltr.io/ubergraph/1.4/meta_knowledge_graph https://automat.test.transltr.io/ubergraph/1.4/meta_knowledge_graph https://automat.transltr.io/ubergraph/1.4/meta_knowledge_graph https://automat.renci.org/ubergraph-nonredundant/1.4/meta_knowledge_graph https://automat.ci.transltr.io/ubergraph-nonredundant/1.4/meta_knowledge_graph https://automat.test.transltr.io/ubergraph-nonredundant/1.4/meta_knowledge_graph https://automat.transltr.io/ubergraph-nonredundant/1.4/meta_knowledge_graph https://automat.renci.org/viral-proteome/1.4/meta_knowledge_graph https://automat.ci.transltr.io/viral-proteome/1.4/meta_knowledge_graph https://automat.test.transltr.io/viral-proteome/1.4/meta_knowledge_graph https://automat.transltr.io/viral-proteome/1.4/meta_knowledge_graph

putmantime commented 1 year ago

I put all the URLS and team info listed in this thread in a google sheet Please add any additional monitors that you would like implemented there.

I have set up Uptime Robot Monitors for all of them, that are integrated with Slack Channels and the contact for each has been invited to that channel.

Automat currently has one channel for all the services. I realize this is not ideal but it is tedious work to set up the integrations and I will eventually get them all into their own channels if that would be preferred.

Please let me know if you have any issues with alerts on your slack channel.

In addition to the slack integrations I will be setting up an overall dashboard with a persistent URL soon.