Open haixuanTao opened 1 month ago
- Remove the dora coordinator. The coordinator is still source of many issues while most of our use case are local single dataflow, making it very difficult to justify using it. We had 15 issues with the coordinator so far ( coordinator )
Unfortunately we need the coordinator for deploying dora to multiple machines. It is responsible for connecting daemons to each other and for handling CLI commands. How would you do these things without the coordinator?
Makes daemon connect to other daemon when dataflow is spawned
But how are the daemons finding each other? We could use some multicast messages for local networks or use some existing communication layer such as zenoh.
makes them based on IP not machine id as it's going to be more transparent to user and more easy to understand connection issue
The reason we're using machine IDs instead of IPs is that the IPs might change or even be dynamic. If you want to commit your dataflow.yml to git, you need some kind of abstraction to hide the actual IP addresses. Even in a local network with fixed IPs, the actual prefix will probably differ between environments.
- Make the dora-daemon print to stdout what is happenning to the dataflow as it is simply to obscure at the moment and things like hardware issues, network connection would be a lot easier to track on a single process stdout. We have refactored couple of time our logging mecanism but it is still too difficult for beginner who might not be familiar with terminal command. This will also make it simpler to integrate with things like
systemctl
We're already doing that, don't we? We can of course lower the default logging level to also print INFO and DEBUG messages instead of only warnings and errors.
- Refactor our cli so that
dora daemon --start-dataflow
becomes the defaultdora start
behaviour and makes the dora-daemon the default process that is started when running the cli. This will removes a layer of complexity of having the daemon running in the background.
So you want a separate, single-process dora
command that does not require launching any additional executables? I'm fine with adding such a command in addition, but I don't think that we should change our core architecture. After all, distributed deployment is an explicit goal of dora, so this should still be possible.
But how are the daemons finding each other? We could use some multicast messages for local networks or use some existing communication layer such as zenoh.
I really think that we should let user explicit which daemon address they want to connect to and that we should not abstract the networking layer as it out of our scope. There is many thing out of our control:
It's hard to do. If we have 2 daemons in different networks: LAN and Internet, there simply is no simple way for dora to make sure those to can communicate. We had the problem at the GOSIM Hackathon that, we could not get an external ip and so we could not connect a cloud machine with our local machine.
It's also hard to debug those issues as by abstracting away the networking part, if there is a networking issue, on any of the connection CLI <-> Coordinator <-> Daemon <-> Daemon, there is no simple way to debug it, and we might not be able to cover all error cases.
It's hard to automate. I dare you to, try to connect 10 local machine at the same time. Meaning
This is extremely tiring and inefficient.
The reason we're using machine IDs instead of IPs is that the IPs might change or even be dynamic. If you want to commit your dataflow.yml to git, you need some kind of abstraction to hide the actual IP addresses. Even in a local network with fixed IPs, the actual prefix will probably differ between environments.
I would largely prefer using environment variable to hide IP as it is most of the time done for secrets in docker, kubernetes, github action, package managers, ... As you exactly point out, the IP might change, and the prefix might not be the same depending of the environment. Abstracting away the IP address means that, we need to make sure that ANY daemon connects to the daemon in question only based on an ID which can hide its registered IP that might be local. I can only encourage you to try it between a local daemon and a remote daemon.
We're already doing that, don't we? We can of course lower the default logging level to also print INFO and DEBUG messages instead of only warnings and errors.
No, currently the only error that is reported is fatal error from daemon but not from actual nodes. And sporadically, errors are not reported.
But, it will make it better to just let stdout of each nodes go to the dora start
stdout. There is too many stdout that is extremely important within the python and robotic ecosystem ( connection error, downloading models, ..., hardware issues, ...) that we can't expect to catch in dora.
So you want a separate, single-process dora command that does not require launching any additional executables? I'm fine with adding such a command in addition, but I don't think that we should change our core architecture. After all, distributed deployment is an explicit goal of dora, so this should still be possible.
I think that the current setup is just impossible to make distributed deployment, and I can only encourage you to try for yourself, as it's just going to be an uphill fix of issues, instead of working on meaningful features...
What I think we should do is:
nodes:
- id: rust-node
build: cargo build -p multiple-daemons-example-node
path: ../../target/debug/multiple-daemons-example-node
inputs:
tick: dora/timer/millis/10
outputs:
- random
- id: runtime-node
_unstable_deploy:
machine: 10.14.0.2
operators:
- id: rust-operator
build: cargo build -p multiple-daemons-example-operator
shared-library: ../../target/debug/multiple_daemons_example_operator
inputs:
tick: dora/timer/millis/100
random: rust-node/random
outputs:
- status
- id: rust-sink
_unstable_deploy:
machine: 10.14.0.1
build: cargo build -p multiple-daemons-example-sink
path: ../../target/debug/multiple-daemons-example-sink
inputs:
message: runtime-node/rust-operator/status
# machine 10.14.0.1
dora daemon --address 0.0.0.0
# > Stdout of this machine here
# machine 10.14.0.2
dora daemon --address 0.0.0.0
# > Stdout of this machine here
# machine 10.14.0.3
dora daemon start dataflow.yaml
# > Stdout of this machine here
And no other process and the inter daemon connection happen on daemon start.
We can then put dora daemon
in either systemctl
or the OS specific manager for having them automatically spawn or restart on failure.
I totally agree with those ideas:
Using Zenoh for daemon inter connection could be great, and I can help!
We should definitely make this single process daemon. However, what's the behavior with multiple dataflow? Your new architecture seems to be "1 daemon for 1 dataflow", can you explain?
For multiple dataflow, I would be more in favor of having multiple daemons.
I think it would make things simpler. It would also avoid conflict between multiple dataflow. Imagining building a dataflow and breaking another one running in parallel.
The only machine that would run multiple dataflow is cloud machine, and I thiink that it would be better if each dataflow has a separate daemon with its own address.
I fully agree with you that the current setup for multi-machine deployments is cumbersome to use and very difficult to get right. It is just a first prototype without any convenience features.
You mention multiple things, so let me try to reply to them one by one:
Regarding "specify IP instead of machine ID in dataflow.yml": You propose that we specify the machine by IP address, instead of ID:
_unstable_deploy: machine: 10.14.0.2
I don't think that it's a good idea to specify the machines like that because the
dataflow.yml
file is often commited to git. So if someone else wants to check out your project, they need to manually edit the file to replace all of these IPs with their local IPs. Then their git working directory is dirty, so they have to decide wether they want to commit the IP changes or keep the file around as dirty.
Also, the IP of the target machine might change. For example, DHCP might assign a new IP to your remote machine after it's restarted. Then you need to update all of your dataflow.yml files to change the old IP to the new. You can also have this situation for cloud machines that are assigned a public IP from a pool.
Regarding "connecting cloud and local machines:
- If we have 2 daemons in different networks: LAN and Internet, there simply is no simple way for dora to make sure those to can communicate. We had the problem at the GOSIM Hackathon that, we could not get an external ip and so we could not connect a cloud machine with our local machine. I'm not sure how specifying an IP address in the dataflow would help with that? If the machine has no public IP, how can we connect to it?
In my understanding, the current approach based on machine IDs should make this easier compared to defining IPs. The idea is that you don't need to know the IP addresess of each daemon (they might not even have a public IP). The only requirement is that the coordinator has a public IP and is reachable by the daemons. Then the coordinator can communicate back to the daemons through that connection. The only remaining challenge are inter-daemon messages in such a situation, which would require some sort of tunneling or custom routing.
I think using zenoh could help to make this simpler, but for that we also need some kind of identifier for each daemon because it abstracts the IP address away.
Regarding "automation":
It's hard to automate. I dare you to, try to connect 10 local machine at the same time. Meaning
- ssh to each individual machine
- run the dora daemon connection line and expliciting indiviual machine name. Knowing that this might not be a typical linux machine.
- run the dora coordinator ( Make sure to only have only 1 running as multiple will make it impossible to recover from connecting)
- run the dora cli
- and destroy everything and restart if there is any failure. I'm not sure how specifying IP addresses would help with this? Sure, you avoid the machine ID argument, but you still have to record the IP addresses for each machine and assign the nodes to machines.
My intention was that the process should look like this:
dora connect
commanddora start
dora destroy
would be only needed if you want to shut down your machines and stop everything that is currently running.Regarding _"using ENV variables to specify IP addresses"
I would largely prefer using environment variable to hide IP as it is most of the time done for secrets in docker, kubernetes, github action, package managers, ... Do I understand you correctly that you are thinking of something like this:
_unstable_deploy: machine: IP_MACHINE_1
In that case, we still need to define some mapping from
IP_MACHINE_1
to the actual daemon, right?
Regarding "one daemon per dataflow":
For multiple dataflow, I would be more in favor of having multiple daemons.
I think it would make things simpler. It would also avoid conflict between multiple dataflow. Imagining building a dataflow and breaking another one running in parallel. I'm not sure how it would make things simpler? With separate daemons per dataflow you would need to do the whole setup routine (ssh to the machines, etc) again and again for every dataflow you want to start, no? Also, you would need to specify which coordinator/daemon you want to connect to for every
dora
CLI command if there are multiple running in parallel. I fully agree that dataflows should not be able to interfere with each other. We already have all the messages namespaced to their dataflow UUID, so there is no way that nodes are receiving messages from different dataflows. There are still robustness issues of course, which can bring the whole daemon down. This is something that we have to work on and improve, but I don't think that changing the whole design of dora brings us to this goal any faster.
In my understanding, the current approach based on machine IDs should make this easier compared to defining IPs. The idea is that you don't need to know the IP addresess of each daemon (they might not even have a public IP). The only requirement is that the coordinator has a public IP and is reachable by the daemons. Then the coordinator can communicate back to the daemons through that connection. The only remaining challenge are inter-daemon messages in such a situation, which would require some sort of tunneling or custom routing.
Yes, exactly, the hard part ( independently from dora ) is interdaemon connection.
As mentioned above, I genuinely don't think that we should abstract away the network stack, in the near future at least.
Note that tunneling and custom routing should happen on every inter daemon connection and also be resilient to disconnection.
If we try to use something like ssh it is extremely hard to keep the connection up all the time with systemctl
as you can have recursive failure.
It really sounds like having some uncommited env file and/or having to commit IP address that needs to be somehow protected is either to deal with than try to abstract away the network layer.
We tried doing ssh tunneling during the gosim hackathon and this is way too hard for the common robotist that just want to connect his model from the cloud or LAN to his robot.
Also, the IP of the target machine might change. For example, DHCP might assign a new IP to your remote machine after it's restarted. Then you need to update all of your dataflow.yml files to change the old IP to the new. You can also have this situation for cloud machines that are assigned a public IP from a pool.
Basically my problem is that we have to do some imperative:
dora daemon --coordinator-addr COORDINATOR_ADDR --machine-id abc
To be able to connect to the cooridnator and run a process.
This step means that you need to have some way to either connect to the computer (ssh, ...) or restart this step on launch (systemctl, ...) and modify either machine-id or coordinator_addr by HAND if you want to change coordinator or name. This step is really not intuitive to me and I don't see how we can scale this.
Having a simple dora daemon
on a robot means that anyone can connect to the robot without having to ssh it, and needs 0 hand configuration.
Yes, IP address are dynamic, but there is plenty of tools to fix them and I would rather people use DNS/NAT to fix this issue rather than having them ssh to the robot computer.
My intention was that the process should look like this:
SSH to the machine where the coordinator should run and start it there.
Remember the IP of the coordinator machine
SSH to each machine where you want to run a daemon
Start the daemon, assigning some unique ID (could be as simple as machine-1, machine-2, etc)
Pass the coordinator IP as argument so that the daemon can connect
Run the dora CLI to start and stop dataflows
If the coordinator is running on a remote machine, we need to specify the coordinator IP as argument
In the future, it would be nice to remember the coordinator IP in some way, e.g. through a config file or by having a separate dora connect command
If a dataflow fails: Fix your files, then do another dora start
You should never need to restart any deamon or coordinator.
dora destroy would be only needed if you want to shut down your machines and stop everything that is currently running.
(If there are any instances where we need to restart a daemon, we should fix those bugs.)
This genuinely takes an 1 hour to do on 10 robots, where it could have been instant with IP address and need to be done on every start. It's also super easy to get wrong, and super annoying to deal with ssh passwords,,,
_unstable_deploy: machine: IP_MACHINE_1
More like:
_unstable_deploy:
machine: $IP_MACHINE_1
I truly believe that dora should be ssh-free for the most part otherwise the barrier to entry is going to be very high.
It really sounds like having some uncommited env file and/or having to commit IP address that needs to be somehow protected is either to deal with than try to abstract away the network layer.
But how does this avoid tunneling/custom routing? If the machine has no public IP, how can you reach it?
Having a simple
dora daemon
on a robot means that anyone can connect to the robot without having to ssh it, and needs 0 hand configuration.
Thanks for clarifying your use case. I understand that you're using a robot that you want to reboot repeatedly and you want to avoid doing manual work on every reboot, right?
I don't think that we have to change the whole design of dora for this. It would probably be enough to have some kind of "remote configuration" feature. For example, something like this:
dora daemon --listen-for-remote-config <port>
argument (the arg name is only a placeholder)machines:
- machine_1: dynamic
- machine_2: 192.168.0.57:8080
- machine_3: 192.168.0.64:1234
dynamic
are treated like before (i.e. daemon initiates the connection to the coordinator)init_with_config
message to the specified IP/port.--machine-id
and --coordinator-address
arguments.This way, you could add the dora daemon --listen-for-remote-config <port>
command to your startup commands on the robot and you would never need to touch it on a reboot. The dataflow.yml file also remains unchanged and independent of the local network setup. The daemon IP addresses are specified in a new coordinator config file that you can apply to multiple dataflows. And the changes to dora are minimal, so we can implement this quickly. What do you think?
This genuinely takes an 1 hour to do on 10 robots, where it could have been instant with IP address and need to be done on every start.
I think the part that confused me was that it needs to be "done on every start". That's because you want to completely reboot your robot in between runs, am I understanding this right? Because for normal systems, you could just leave the coordinator and daemons running and reuse them for the next dora start
.
Another possible alternative:
Use zenoh for daemon<->coordinator connection and rely on multicast messages for discovery. This would allow the daemon to send some kind of register message to whole local network, which the coordinator could listen for. Then the coordinator could assign a machine ID to the daemon. This way, you would not need to specify the --machine-id
and --coordinator-addr
arguments either, without requiring the extra --listen-for-remote-config
argument.
For multi-network deployments (e.g. cloud), you would still need to define some zenoh router when starting the dora daemon
. However, this could be part of an env variable that you set only once when you set up your cluster.
It really sounds like having some uncommited env file and/or having to commit IP address that needs to be somehow protected is either to deal with than try to abstract away the network layer.
But how does this avoid tunneling/custom routing? If the machine has no public IP, how can you reach it?
The idea is that we don't want to make an abstraction layer that connects daemon and expose something that might not work.
If the IP is something like: 127.0.0.1, it is explicit that this is not going to work.
The thing is that we have to let the user figure out how his going to be routing the IPs and not: put some machine id and connect to a public coordinator and hope it works
and the thing is that we actually have no idea if it can work or not
I don't think that we have to change the whole design of dora for this. It would probably be enough to have some kind of "remote configuration" feature. For example, something like this:
* Add a `dora daemon --listen-for-remote-config <port>` argument (the arg name is only a placeholder) * When started with this argument, the daemon will listen on the specified port for connections from the coordinator * We add a coordinator config that could look something like this: ```yaml machines: - machine_1: dynamic - machine_2: 192.168.0.57:8080 - machine_3: 192.168.0.64:1234 ``` * When starting the coordinator, we pass this config file as argument * Machines set to `dynamic` are treated like before (i.e. daemon initiates the connection to the coordinator) * For machines set to IP addresses, the coordinator sends an `init_with_config` message to the specified IP/port. * This message contains the machine ID. * Upon receiving this config message, the daemon uses the received values as `--machine-id` and `--coordinator-address` arguments.
This way, you could add the
dora daemon --listen-for-remote-config <port>
command to your startup commands on the robot and you would never need to touch it on a reboot. The dataflow.yml file also remains unchanged and independent of the local network setup. The daemon IP addresses are specified in a new coordinator config file that you can apply to multiple dataflows. And the changes to dora are minimal, so we can implement this quickly. What do you think?
I'm sorry but there is already so many step and we want to add an additional 4.
So the workflow is going to be:
It is simply impossible for me to see this being consistently reliable, while we could have just:
And this does not even resolve the problem that we're hiding the risk of daemon not connecting.
Another possible alternative:
Use zenoh for daemon<->coordinator connection and rely on multicast messages for discovery. This would allow the daemon to send some kind of register message to whole local network, which the coordinator could listen for. Then the coordinator could assign a machine ID to the daemon. This way, you would not need to specify the
--machine-id
and--coordinator-addr
arguments either, without requiring the extra--listen-for-remote-config
argument.For multi-network deployments (e.g. cloud), you would still need to define some zenoh router when starting the
dora daemon
. However, this could be part of an env variable that you set only once when you set up your cluster.
This sounds really complicated, while most of the time, you can easily find the IP address of the robot you want to connect to. I genuinely don't think that finding an IP address is hard, compared to setuping a whole zenoh cluster.
This is something that we have to work on and improve, but I don't think that changing the whole design of dora brings us to this goal any faster.
I mean we have two well identified issues:
Both have been opened for close to 2 month.
I really think that there is a limit to the complexity we can handle and I don't see in our discussion how we can make it work and improve
given the development speed we are able to do.
It is simply impossible for me to see this being consistently reliable, while we could have just:
* dora daemon connect to other daemon using direct address.
But how would that work in detail? The daemons don't know the IP addresses of each other. The only way I see is that we use the IPs specified in the dataflow.yml
file, which is only known when a dataflow is started. So each of the daemons would need to listen on some public IP/port for incoming dataflow YAML files and then do the following:
- dora daemon start dataflow
This sounds like you want to remove both the coordinator and the CLI? And that the dora daemon start
command then performs all the coordinator tasks of coordinating the other daemons and collecting logs?
I fear that such a drastic change of the design would result in a lot of additional work. I think that there are faster and easier ways to solve the mentioned issues.
This is something that we have to work on and improve, but I don't think that changing the whole design of dora brings us to this goal any faster.
I mean we have two well identified issues: [...] Both have been opened for close to 2 month.
I really think that there is a limit to the complexity we can handle and I don't see in our discussion how we can make it
work and improve
given the development speed we are able to do.
I'm not sure how these issues are related?
I'm aware that we have many many things on our plate. That's why I think that we don't have the capacity to rearchitect dora completely. Redesigning a daemon communication mechanism that works in a distributed way without a coordinator sounds like a lot of work and like an additional source of complexity. A centralized coordinator that has full control of all the daemons makes the design much less complex in my opinion.
Let's maybe take a step back. One of our initial design goals was that the dataflows could be controlled from computers that are not part of the dataflow. For example, that you could use the dora CLI on your laptop to control a dataflow that is running on some cloud machines. To be able to support this use case, we need some entity that the CLI can connect to. That was the motivation for creating the dora coordinator.
Assuming a simple network, all the nodes could directly communicate with each other using TCP messages or shared memory. This doesn't require a daemon, but the daemon makes things easier for the nodes. Without it, each node would need to be aware of the whole dataflow topology and maintain its own connections to other nodes.
The difficult part is to create all of these network connections, especially if the network topology is more complex. It doesn't really matter which entity creates these connections. So I'm not sure how removing the coordinator would simplify things.
We can of course simplify things if we require simple network topologies. If we assume that the CLI always runs on the same machine as the coordinator and one of the daemons, and that the remote daemons are in the same network and reachable by everyone through the same IP, we can of course remove a lot of complexity. But we also lose functionality and are no longer able to support certain use cases.
I feel like there is a lot of valid points and important things in this issue thread, but it's becoming difficult to follow. I think it would be a good idea to first collect the different problems, pain points, and use cases we want to improve. Ideally, we split then into separate discussions. It's probably a good idea to avoid suggesting specific changes in the initial post.
Then we can propose potential solutions as comments and discuss them. I think that we would achieve a more productive discussion this way.
Edit: I started created some discussions for the problems and usability issues mentioned in this thread:
I also added proposals for solutions to each discussion. Of course feel free to add alternative proposals!
The coordinator and the daemon running as a background process is extremely difficult to maintain especially and creates a lot of hanging issues while bringing hard to quantify value as we're almost always running single dataflow.
systemctl
and therefore dora is nearly impossible to start on boot.Having backgound process makes it nearly impossible to keep up with environmental variables that can be very changing in distributed setup.
I think the only moment we need a background process is for remote daemon, which need to be able to connect to other daemon when spawning a dataflow.
What needs to change
dora daemon --start-dataflow
becomes the defaultdora start
behaviour and makes the dora-daemon the default process that is started when running the cli. This will removes a layer of complexity of having the daemon running in the background.systemctl
What this will bring
This is going to make everything a lot easier to embed dora in other application such as python and rust. As well as makes the daemon more easily be configurable as a single web server.
Changelog
This should not be a breaking change except from the fact that we will not uses
dora up
anymore.