SwissDataScienceCenter / renku-python

A Python library for the Renku collaborative data science platform.
https://renku-python.readthedocs.io/
Apache License 2.0
37 stars 29 forks source link

Support for Cloud Storage #3260

Open m-alisafaee opened 1 year ago

m-alisafaee commented 1 year ago

An epic to group stories to support cloud storage providers in Renku. The goal is making data in cloud storage (e.g. AWS S3, Azure, ...) more accessible in renku CLI. See https://github.com/SwissDataScienceCenter/renku-design-docs/blob/main/rfcs/007-external-data-handling/007-external-data-handling.md for a complete description.

m-alisafaee commented 1 year ago

Rok's feedback regarding S3 UX:


I kicked the tires a bit on the S3 dataset functionality - it’s starting to look quite nice. I found a few rough edges - here’s a summary: What I did: created a bucket with a subdirectory, added a bunch of files to it. I wanted to make a renku dataset out of this to track data usage.

Traceback (most recent call last): File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/renku/infrastructure/repository.py", line 1757, in _run_git_command return getattr(repository.git, command)(*args, kwargs) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/git/cmd.py", line 639, in return lambda *args, *kwargs: self._call_process(name, args, kwargs) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/git/cmd.py", line 1184, in _call_process return self.execute(call, **exec_kwargs) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/git/cmd.py", line 984, in execute raise GitCommandError(redacted_command, status, stderr_value, stdout_value) git.exc.GitCommandError: Cmd('git') failed due to: exit code(128) cmdline: git add data/isolated-mw-models/12M_hr/12M_hr.00010.gz stderr: 'fatal: pathspec 'data/isolated-mw-models/12M_hr/12M_hr.00010.gz' did not match any files'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/renku/ui/cli/exception_handler.py", line 92, in main return super().main(args, kwargs) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke return __callback(args, kwargs) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/renku/ui/cli/dataset.py", line 901, in unlink file_unlink_command().with_communicator(communicator).build().execute( File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/renku/command/command_builder/command.py", line 262, in execute hook(self, context, result, *args, *kwargs) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/renku/command/command_builder/command.py", line 198, in _post_hook raise result.error File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/renku/command/command_builder/command.py", line 248, in execute output = self._operation(args, kwargs) # type: ignore File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/renku/core/dataset/dataset.py", line 359, in file_unlink repository.add(path_file) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/renku/infrastructure/repository.py", line 214, in add self.run_git_command("add", batch, force=force) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/renku/infrastructure/repository.py", line 391, in run_git_command return _run_git_command(self._repository, command, args, **kwargs) File "/Users/rok/Projects/renku-test-projects/isolated-disks/.venv/lib/python3.10/site-packages/renku/infrastructure/repository.py", line 1759, in _run_git_command raise errors.GitCommandError( renku.core.errors.GitCommandError: Git command failed: Cmd('git') failed due to: exit code(128) cmdline: git add data/isolated-mw-models/12M_hr/12M_hr.00010.gz stderr: 'fatal: pathspec 'data/isolated-mw-models/12M_hr/12M_hr.00010.gz' did not match any files'



FYI this is the dataset: https://renkulab.io/projects/rok.roskar/isolated-disks/datasets/isolated-mw-models/

You can’t really pull the data because the switch S3 doesn’t make it easy to make all objects in a public bucket… public :smile: