SwissDataScienceCenter / renku-python

A Python library for the Renku collaborative data science platform.
https://renku-python.readthedocs.io/
Apache License 2.0
37 stars 29 forks source link

renku run fails when using pipes in commands #1348

Closed robinengler closed 3 years ago

robinengler commented 5 years ago

Hi renku team, I tried a couple of very simple bash commands with renku to test it. But it seems that each time a command has a bash pipe in it, renku run fails and complains about the directory being "dirty" when clearly it is not the case.

Here is an example:

# first we create a new directory to store the output:
mkdir data/outputs && git add "data/outputs/" && git commit -m "created outputs directory."
# make sure the repo is clean (no uncommited changes)
git status
On branch master
Your branch is ahead of 'origin/master' by 4 commits.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean
# Try to run a simple bash command with renku run:
renku run cat data/testDataset/continentSize.txt | sort -rn -k2 > data/outputs/continentSize_sorted_again.txt
Error: The repository is dirty. Please use the "git" command to clean it.

On branch master
Your branch is ahead of 'origin/master' by 4 commits.
  (use "git push" to publish your local commits)

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        data/outputs/

nothing added to commit but untracked files present (use "git add" to track)

Once you have added the untracked files, commit them with "git commit".

If I now try an almost identical command, but without a pipe in it, it works as expected.

# almost same command without pipe:
renku run sort -nr -k2 data/testDataset/continentSize.txt > data/outputs/continentSize_sorted_again.txt
# ^ this works.

I tried a couple of other commands, and each time there is a pipe in it, it fails with the same message saying "Error: The repository is dirty." when it fact it's clean.

Here is a very simple reproducible example showing the problem:

renku run echo "foo bar" | sed 's@bar@foo@' > newFile.txt

Am I doing something wrong in my commands, or is this behaviour expected ? Thanks, Robin

P.S. From my limited testing, brackets in commands also have deleterious effects on renku run: the two commands below fail (with different errors).

renku run ( echo "foo bar" ) > newFile.txt
renku run cat <(echo "foo bar") > newFile.txt
rokroskar commented 5 years ago

Hi @robinengler thanks for reporting this -- unfortunately, pipes are currently not supported in the renku CLI: https://renku-python.readthedocs.io/en/latest/cli.html#detecting-standard-streams

Perhaps @jirikuncar can elaborate/comment on the problems with brackets?

fgeorgatos commented 5 years ago

@robinengler : as a workaround, please tell us if any of the following works:

The point is, that the exec of the process must fully capture the execution from start to end.

robinengler commented 5 years ago

Thanks @rokroskar and @fgeorgatos for the reply and the suggested workarounds. I have tried the second option where echo the command and then pipe the whole thing into renku run. Here is an example of what I tried:

echo "sort -k2 -n data/processed/joined_batch1.txt | grep -v FAIL" | renku run bash > data/processed/sorted_batch1.txt

The command now works, but I think that this still breaks renku's tracking of the workflow, because if I now look at the .cwl file produced for this step, it's kind of empty:

cat .renku/workflow/9e4848d238dd4433a5ea60be6915a7a2_bash.cwl

arguments: []
baseCommand:
- bash
class: CommandLineTool
cwlVersion: v1.0
hints: []
inputs: {}
outputs:
  output_stdout:
    streamable: false
    type: stdout
permanentFailCodes: []
requirements: []
stdout: data/processed/sorted_batch1.txt
successCodes: []
temporaryFailCodes: []

So it looks like the actual command to process the file was not recorded properly. This seems to be confirmed by the fact that, when I then try to regenerate the output file with: renku rerun data/processed/sorted_batch1.txt The ouput file is now empty (so it knows it should create this ouput fle, but doesn't know how to create it).

From this I think that if one wants to use pipes in renku, then embedding the bash commands in a shell script is the way to go :-)

robinengler commented 5 years ago

I now also tried the other option, where the command (that contains a pipe) is embedded into a shell script. It seems best if the shell script is written so that both the input and output files are passed to the script (rather than say auto-detected by the script), so that renku can properly recognize the input and output files.

It is then possible to run the script with renku: renku run notebooks/sortFile.sh data/processed/joined_batch1.txt data/processed/sorted_batch1.txt

The commands works well, and looking at the .cwl file the input and output files are properly identified. There is still a problem though when I tried do a "renku rerun". For some reason, renku does not find the script anymore:

renku rerun data/processed/sorted_batch1.txt

/home/jovyan/.local/pipx/venvs/renku/lib/python3.6/site-packages/renku/models/provenance/activities.py:597: YAMLLoadWarning:
  *** Calling yaml.load() without Loader=... is deprecated.
  *** The default Loader is unsafe.
  *** Please read https://msg.pyyaml.org/load for full details.
  process = CWLClass.from_cwl(yaml.load(data))
/home/jovyan/.local/pipx/venvs/renku/lib/python3.6/site-packages/renku/models/provenance/activities.py:597: YAMLLoadWarning:
  *** Calling yaml.load() without Loader=... is deprecated.
  *** The default Loader is unsafe.
  *** Please read https://msg.pyyaml.org/load for full details.
  process = CWLClass.from_cwl(yaml.load(data))
Resolved '.renku/workflow/b09b7815a5694042b5b9464913a6ed2c.cwl' to 'file:///home/jovyan/testproject2/.renku/workflow/b09b7815a5694042b5b9464913a6ed2c.cwl'
[workflow ] start
[workflow ] starting step step_1
[step step_1] start
[job step_1] /tmp/tmpl6mx8lzt$ join \
    --check-order \
    --header \
    '-t ' \
    -1 \
    1 \
    -2 \
    1 \
    /tmp/tmpdgol84z4/stgfe8f6c0e-4777-4adb-af18-a33ce08ad594/testData_batch1_a.txt \
    /tmp/tmpdgol84z4/stg7c22c132-c9a5-4a0e-b0e0-8daaee717837/testData_batch1_b.txt > /tmp/tmpl6mx8lzt/data/processed/joined_batch1.txt
[job step_1] completed success
[step step_1] completed success
[workflow ] starting step step_2
[step step_2] start
[job step_2] /tmp/tmpccu9tqsg$ notebooks/sortFile.sh \
    /tmp/tmpiwiys_qz/stg3ab66fd2-e02f-40e0-8bc0-b925d28c7e1d/joined_batch1.txt \
    data/processed/sorted_batch1.txt
'notebooks/sortFile.sh' not found
[job step_2] completed permanentFail
[step step_2] Output is missing expected field file:///home/jovyan/testproject2/.renku/workflow/b09b7815a5694042b5b9464913a6ed2c.cwl#step_2/output_0
[step step_2] completed permanentFail
[workflow ] completed permanentFail
Ahhhhhhhh! You have found a bug. 🐞
rokroskar commented 5 years ago

@robinengler apologies for letting this issue sit idle for so long - the problem you describe (where the executable is also a dependency) has recently been resolved (see https://github.com/SwissDataScienceCenter/renku-python/issues/495) - could you try updating your renku version and rerun the command?

If you are running this inside a jupyter notebook on renkulab you can update it by either building the image from renku/singleuser:latest or upgrading it inside the running notebook server (see https://renku.readthedocs.io/en/latest/user/cli-installation.html#upgrading). If you are using it on your machine the upgrade process will depend on how you installed it - let me know if you need help.

Panaetius commented 3 years ago

Closed due to inactivity and since we don't plan on supporting pipes (and there's no clear direction on how we could support pipes)