dora-rs / dora

DORA (Dataflow-Oriented Robotic Architecture) is middleware designed to streamline and simplify the creation of AI-based robotic applications. It offers low latency, composable, and distributed dataflow capabilities. Applications are modeled as directed graphs, also referred to as pipelines.
https://dora-rs.ai
Apache License 2.0
1.59k stars 95 forks source link

Dora-daemon and dora-coordinator are background process that does not see cli environment variable #363

Open haixuanTao opened 1 year ago

haixuanTao commented 1 year ago

Describe the bug Sometimes an environment variable is updated before using the dora-cli. But because the coordinator and the daemon are background process. They do not see the updated environment variable. This can generate some confusion.

To Reproduce Steps to reproduce the behavior:

  1. Dora start daemon: dora up
  2. Change an env variable
  3. Start a new dataflow: dora start dataflow.yaml
  4. bis. Expect the env variable to be the latest one but see the previous one or an empty one
  5. Stop dataflow: dora stop
  6. Destroy dataflow: dora destroy

Expected behavior I would expect the cli to pass the latest value to the daemon and coordinator.

phil-opp commented 1 year ago

I would expect the cli to pass the latest value to the daemon and coordinator.

We need to keep deployment to remote machines in mind. We probably don't want the CLI to pass the local env variables to the remote deploy machine.

phil-opp commented 1 year ago

We already support setting environment variables through the dataflow yaml file, right? Would this be a possible alternative for your use case?

haixuanTao commented 1 year ago

So this is the thing, when we use environment variable within the YAML Graph, it uses old env variables.

In the case we think that environment variables should not be shared in a distributed environment which I can understand, maybe we should change our approach to using variables.

haixuanTao commented 1 year ago

One alternative would be to add a --env argument that enables user to pass env variable when running dataflows, in the manner of docker

phil-opp commented 1 year ago

Ah, understood! So I guess the main issue is the env expansion that we introduced in https://github.com/dora-rs/dora/commit/933dadc68bccc9e3fddd8030ee7c9fc68260ef7c. For some cases (such as the $HOME example in the commit description) we want that the expansion happens on the target machine, but in other cases we want it to happen on the CLI machine.

I'm fine with changing the behavior to do the env expansion on the CLI machine already, but I fear that there is always some chance of confusion, depending on what you expect. Not sure what we can do to avoid this though, aside from documenting it clearly....

Hennzau commented 2 months ago

🚀 Proposal: Enhancing Dora for Better Usability

Hello! I believe this unresolved issue is crucial for the better adoption of Dora, especially among less experienced users. At first glance, one might expect that our nodes and the Daemon would have access to the CLI's environment variables. However, it's completely understandable that this could lead to a lot of issues. I think there's a way to find an alternative solution.

🛠️ Current build Process

We already have a build process that can be assigned to nodes, which activates when we run dora build. This procedure is particularly useful when we need to install requirements or compile a program as it's usually slow.

💡 New Proposal: preprocess Procedure

It could be interesting to introduce another procedure called preprocess, which would contain a list of scripts to execute when running dora start, before trying to resolve and launch nodes. This would help enhance the environment of our dataflow or specific nodes. For example:

preprocess:
  - id: activate_python
    path: .venv/bin/activate.sh  # Will activate a Python environment
  - id: load_env_var_ext
    path: .my_env_var.sh  # Will export useful custom environment variables

nodes:
  - id: camera
    path: camera.py # will be launched by the python env, regardless where the daemon or CLI was spawned

  - id: plot
    path: plot.py # same

🔄 Node-Specific Environment Variables

If we need different environment variables for a particular node, we could add them like this:

preprocess:
  - id: activate_python
    path: .venv/bin/activate.sh
  - id: load_env_var_ext
    path: .my_env_var.sh

nodes:
  - id: camera
    path: camera.py

  - id: plot
    path: plot.py

  - id: weird_script
    preprocess:
      - id: activate_another_python
        path: .another_venv/bin/activate.sh
    path: weird_script.py # will be launched by this other_python

🌐 Distributed Dataflow Consideration

In the case of a distributed dataflow, absolute paths will need to be specified instead of relative ones, just like with everything else.

Let me know what you think! 😊

haixuanTao commented 2 months ago

So I definitely feel there is something to be done with running command before running the full nodes.

I think the preprocess idea is really interesting!

I would maybe do couple of changes.

Maybe what we could do is be slightly more github action like and make it possible to have multiline shell script in the likes of:

        run: |
          source /opt/ros/humble/setup.bash 
          cargo run --example cxx-ros2-dataflow --features="ros2-examples"

It would make it a bit more readable as people don't need to search for the shell script and also make it more easy to get started with if they used github action before.

Currently we should be able to ( or manage to ) do something like this:

      - id: terminal-print
        build: cargo build -p terminal-print
        path: shell
        args: |
          source /opt/ros/humble/setup.bash 
          cargo run --example cxx-ros2-dataflow --features="ros2-examples"

What do you think of having something like this?

P.S: the preprocess should probably be per node as it make dependant step easier to track / conflict.

Hennzau commented 2 months ago

I'm okay with using multi-line commands inside the run parameter since filling both the path and args parameters may not be easy :)

haixuanTao commented 2 months ago

In all honesty, I'm drawn with using run as well as it is closer to github action, but I think that it makes it harder to maintain as it means adding another key, and knowing that this will not be backward compatible. Also, there might be errors in having both path and run set.

Hennzau commented 2 months ago

Ok well, that's a good point. So we should stay on this pattern (path + args), documenting it, until there is a real need about something different.

haixuanTao commented 2 months ago

I think one of the issue is that if we only have run, we don't really have a way to control the environment in which the shell itself is spawned with. Like is it using bash or zsh or powershell or cmd, ...

Those things we could manage with ( something like):

      - id: terminal-print
        build: cargo build -p terminal-print
        path: /bin/zsh
        args: |
          source /opt/ros/humble/setup.bash 
          cargo run --example cxx-ros2-dataflow --features="ros2-examples"