Grafana and Prometheus use in Toil are not well documented

jessebrennan commented 2 years ago

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-1159

boyangzhao commented 2 years ago

Hi @jessebrennan Can you provide some more instructions/guidance on how to view the dashboards while the runs is done via AWS? The wording The dashboard can then be viewed in your browser at localhost:3000 while connected to the leader node through toil ssh-cluster:, here if I do toil ssh-cluster, this just let me ssh into the lead node, but this is a terminal session. If on my local machine, I go to localhost:3000, obviously this doesn't go anywhere. I've also tried to do set up a SOCKS proxy with the proxy server as 127.0.0.1:3000; with that I see the dashboard, but then it's not showing any metrics, even though there is a job running.

adamnovak commented 2 years ago

We forward local port 3000 to the cluster whenever you toil ssh-cluster, here: https://github.com/DataBiosphere/toil/blob/9e73170ba180e76f93b967da2271e74c64d72f53/src/toil/utils/toilSshCluster.py#L51-L55

So you should be able to just toil ssh-cluster, and then on the same machine run a browser and go to http://localhost:3000/ and see the Grafana dashboard.

boyangzhao commented 2 years ago

Thanks. When I do toil ssh-cluster --zone us-east-1a <cluster-name>, going to localhost:3000 doesn't work. When I go to localhost:5050, I get the error Failed to connect to localhost:5050 even though I see the dashboard. I think this is what you were referring to after your recent meso dashboard upgrade. See below,

When I tried to do toil ssh-cluster --sshOption="-D8080" --zone us-east-1a <cluster-name>, or with --sshOption="-L5050:localhost:5050", this didn't do anything additionally. And the grafana dashboard at localhost:3000 doesn't show.

adamnovak commented 1 year ago

The Mesos web UI on the leader at port 5050 is provided by Mesos, and at some point after we adopted Mesos its code changed to that it needed to be able to access the leader and workers directly from the browser at their cluster-internal URLs. It can't content itself with just talking to the host and port it appears as being served from.

So to make it work, you need to set up the SSH dynamic port forward (-D8080 or similar), and you need to set the browser's proxy settings to use localhost:8080 or whatever port you picked as a SOCKS proxy; I think you also might need to set it to run DNS requests through the proxy.

This isn't really very convenient, but we're not sure how to improve it without rewriting the Mesos web UI.

The Grafana dashboard should work at http://localhost:3000, if toil ssh-cluster is active and the Grafana dashboard is running on the cluster. We last touched this in about https://github.com/DataBiosphere/toil/pull/4123 and at that point I saw the Grafana dashboard working for @jonathanxu18, but since then I haven't tried to use it and it isn't really under CI, so it is possible it is broken.

One test would be curl http://localhost:3000 on both the cluster and on the local machine; if the dashboard is working they would be expected to give the same non-error output.

adamnovak commented 1 year ago

OK, this doesn't work:

toil launch-cluster -T mesos -z us-west-2a --leaderNodeType t2.medium --keyPairName anovak@kolossus adamnovak-testcluster

toil ssh-cluster -z us-west-2a adamnovak-testcluster

curl http://localhost:3000

exit

toil destroy-cluster -z us-west-2a adamnovak-testcluster

And it is because Grafana (at least on Mesos?) is started by the workflow, not as part of the cluster: https://github.com/DataBiosphere/toil/blob/49d818b448d6d6da8a4f6fecfd603b394c0c626e/src/toil/common.py#L646-L648

So we need to note that in the docs; Grafana only exists while a workflow with --metrics is running.

boyangzhao commented 1 year ago

When I tried to run this locally,

toil-cwl-runner --metrics helloworld.cwl helloworld.job.yaml

It'll finish the run, but the message includes,

...
[2023-02-12T10:57:38+0100] [MainThread] [I] [toil] Using default docker registry of quay.io/ucsc_cgl as TOIL_DOCKER_REGISTRY is not set.
toil_prometheus
WARNING: Published ports are discarded when using host network mode
3e7423f6a8872b2cacf1a7355ecb73cc22ca10d8c1805232eac7d62ede4662e6
toil_grafana
c0a9935e510081e6684b2084c3f4a3778b20e268f91be24373b51fdc19857f21
[2023-02-12T10:57:40+0100] [MainThread] [W] [toil.lib.retry] Error in <function ToilMetrics.add_prometheus_data_source at 0x7f7b5923c9d0>: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')). Retrying after 1 s...
...

I can see the dashboard at localhost:3000 but no stats are recorded.

And if I run this with AWS (using toil launch-cluster, to launch the AWS clusters), with toil ssh-cluster, I can see localhost:5050 for mesos, but localhost:3000 cannot be reached (neither with curl in the lead node, nor with directing to local browser). This is the case whether I try to reach grafana while a workflow is running, or after it is finished. I don't know if it is related, but in this same run, the stderr would how the following,

...
[2023-02-12T09:51:47+0000] [scaler ] [D] [toil.provisioners.clusterScaler] Cluster already at desired size of 0. Nothing to do.
[2023-02-12T09:51:47+0000] [scaler ] [D] [toil.provisioners.aws.awsProvisioner] All nodes in cluster: [Instance:i-08c3d21bce52151b9]
[2023-02-12T09:51:47+0000] [scaler ] [D] [toil.provisioners.aws.awsProvisioner] All workers found in cluster: []
[2023-02-12T09:51:47+0000] [scaler ] [D] [toil.provisioners.aws.awsProvisioner] preemptible workers found in cluster: []
[2023-02-12T09:51:47+0000] [scaler ] [D] [toil.provisioners.clusterScaler] Cluster contains 0 instances
[2023-02-12T09:51:47+0000] [scaler ] [D] [toil.provisioners.clusterScaler] Cluster contains 0 instances of type r5.8xlarge (0 ignored and draining jobs until they can be safely terminated)
[2023-02-12T09:51:47+0000] [scaler ] [D] [toil.provisioners.clusterScaler] Cluster already at desired size of 0. Nothing to do.
[2023-02-12T09:51:47+0000] [MainThread] [D] [toil.leader] Worker shutdown complete in 58.98966884613037 seconds.
[2023-02-12T09:51:47+0000] [MainThread] [D] [toil.serviceManager] Waiting for service manager thread to finish ...
[2023-02-12T09:51:47+0000] [Thread-11 ] [D] [toil.serviceManager] Received signal to quit starting services.
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.serviceManager] ... finished shutting down the service manager. Took 1.0488615036010742 seconds
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.statsAndLogging] Waiting for stats and logging collator thread to finish ...
[2023-02-12T09:51:48+0000] [Thread-15 ] [D] [toil.jobStores.aws.jobStore] Inlining content of 72 bytes
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.statsAndLogging] ... finished collating stats and logs. Took 0.45523929595947266 seconds
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.common] Stopping mtail
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.common] Stopped mtail
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.common] Shutting down batch system ...
[2023-02-12T09:51:48+0000] [Thread-5 ] [D] [toil.batchSystems.singleMachine] Daddy thread cleaning up 0 remaining children for batch system 140648140382112...
[2023-02-12T09:51:48+0000] [Thread-5 ] [D] [toil.batchSystems.singleMachine] Daddy thread for batch system 140648140382112 finishing because no children should now exist
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.deferred] Cleaning up deferred functions system
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.deferred] Opened with own state file /run/lock/304a160fa6c552ec9d63b819e1f02736/deferred/funce_ujoiij
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.deferred] Running orphaned deferred functions
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.deferred] Ran orphaned deferred functions from 0 abandoned state files
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.deferred] Removing own state file /run/lock/304a160fa6c552ec9d63b819e1f02736/deferred/funce_ujoiij
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.batchSystems.mesos.batchSystem] Stopping Mesos driver
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.batchSystems.mesos.batchSystem] Joining Mesos driver
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.batchSystems.mesos.batchSystem] Joined Mesos driver
[2023-02-12T09:51:48+0000] [MainThread] [D] [toil.common] ... finished shutting down the batch system in 0.010252952575683594 seconds.
Traceback (most recent call last):
File "/usr/local/bin/toil-cwl-runner", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/toil/cwl/cwltoil.py", line 3794, in main
outobj = toil.start(wf1)
File "/usr/local/lib/python3.8/dist-packages/toil/common.py", line 1030, in start
return self._runMainLoop(rootJobDescription)
File "/usr/local/lib/python3.8/dist-packages/toil/common.py", line 1470, in _runMainLoop
return Leader(config=self.config,
File "/usr/local/lib/python3.8/dist-packages/toil/leader.py", line 292, in run
self.innerLoop()
File "/usr/local/lib/python3.8/dist-packages/toil/leader.py", line 773, in innerLoop
while self._messages.count(JobUpdatedMessage) > 0 or \
File "/usr/local/lib/python3.8/dist-packages/toil/bus.py", line 533, in count
self._check_bus()
File "/usr/local/lib/python3.8/dist-packages/toil/bus.py", line 527, in _check_bus
self._bus.check()
File "/usr/local/lib/python3.8/dist-packages/toil/bus.py", line 322, in check
self._deliver(message)
File "/usr/local/lib/python3.8/dist-packages/toil/bus.py", line 330, in _deliver
self._pubsub.sendMessage(topic, message=message)
File "/usr/local/lib/python3.8/dist-packages/pubsub/core/publisher.py", line 216, in sendMessage
topicObj.publish(msgData)
File "/usr/local/lib/python3.8/dist-packages/pubsub/core/topicobj.py", line 452, in publish
self.sendMessage(msgData, topicObj, msgDataSubset)
File "/usr/local/lib/python3.8/dist-packages/pubsub/core/topicobj.py", line 482, in sendMessage
listener(data, self, allData)
File "/usr/local/lib/python3.8/dist-packages/pubsub/core/listener.py", line 237, in call
cb(kwargs)
File "/usr/local/lib/python3.8/dist-packages/toil/bus.py", line 348, in handler_wraper
handler(message)
File "/usr/local/lib/python3.8/dist-packages/toil/common.py", line 1656, in logClusterSize
self.log("current_size '%s' %i" % (m.instance_type, m.current_size))
File "/usr/local/lib/python3.8/dist-packages/toil/common.py", line 1648, in log
self.mtailProc.stdin.flush() # type: ignore[union-attr]
BrokenPipeError: [Errno 32] Broken pipe

Do I need to manually start the dockers prometheus, grafana, and mtail locally? Or do I need to manually start grafana on the cluster (if so, how?).

adamnovak commented 10 months ago

That error message might be related, but I think it is happening because we stop mtail and then afterwards try to log the cluster size to it, at the end of the workflow.

It does sound like the involved servers are not coming up as necessary for ToilMetrics.add_prometheus_data_source to succeed, even when running on a Toil-configured Mesos cluster and with --metrics used on the workflow. It might be that Prometheus has broken, somehow. We need to try and replicate that, but I think that's a separate issue from the lack of documentation.

DataBiosphere / toil

Grafana and Prometheus use in Toil are not well documented #4083