DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

Bucket mount names ("Duplicate action name") #154

Open carbocation opened 5 years ago

carbocation commented 5 years ago

I'm writing a script that calls from two (potentially) different gcsfuse-mounted sources. In my testbed, they both happen to be on the same bucket, but in reality, they won't be. So, I tried to --mount the same bucket twice under different names. However, it seems that the naming is related to the bucket, rather than to the alias, so doing this fails. Maybe this is as intended, but it doesn't seem desirable.

2019-04-29 09:24:17.941173: Exception HttpError: <HttpError 400 when requesting https://genomics.googleapis.com/v2alpha1/pipelines:run?alt=json returned "Error: validating pipeline: duplicate action name "mount-ukbb_v2"">
Traceback (most recent call last):
  File "/home/james/anaconda2/bin/dsub", line 11, in <module>
    load_entry_point('dsub==0.3.1', 'console_scripts', 'dsub')()
  File "/home/james/anaconda2/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/commands/dsub.py", line 956, in main
    dsub_main(prog, argv)
  File "/home/james/anaconda2/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/commands/dsub.py", line 945, in dsub_main
    launched_job = run_main(args)
  File "/home/james/anaconda2/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/commands/dsub.py", line 1028, in run_main
    unique_job_id=args.unique_job_id)
  File "/home/james/anaconda2/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/commands/dsub.py", line 1117, in run
    launched_job = provider.submit_job(job_descriptor, skip)
  File "/home/james/anaconda2/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/providers/google_v2.py", line 915, in submit_job
    task_id = self._submit_pipeline(request)
  File "/home/james/anaconda2/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/providers/google_v2.py", line 866, in _submit_pipeline
    self._service.pipelines().run(body=request))
  File "build/bdist.linux-x86_64/egg/retrying.py", line 49, in wrapped_f
  File "build/bdist.linux-x86_64/egg/retrying.py", line 206, in call
  File "build/bdist.linux-x86_64/egg/retrying.py", line 247, in get
  File "build/bdist.linux-x86_64/egg/retrying.py", line 200, in call
  File "build/bdist.linux-x86_64/egg/retrying.py", line 49, in wrapped_f
  File "build/bdist.linux-x86_64/egg/retrying.py", line 206, in call
  File "build/bdist.linux-x86_64/egg/retrying.py", line 247, in get
  File "build/bdist.linux-x86_64/egg/retrying.py", line 200, in call
  File "/home/james/anaconda2/lib/python2.7/site-packages/dsub-0.3.1-py2.7.egg/dsub/providers/google_base.py", line 593, in execute
    raise exception
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://genomics.googleapis.com/v2alpha1/pipelines:run?alt=json returned "Error: validating pipeline: duplicate action name "mount-ukbb_v2"">
mbookman commented 5 years ago

Hi @carbocation !

What is the use case for requesting that the same bucket be mounted twice?

I have concerns that GCSfuse is already a fragile enough solution that having a bucket mounted twice within a single dsub task may be setting yourself up for a bad day.

Is the typical Input and Output File Handling insufficient for your use case?

Thanks.

carbocation commented 5 years ago

I tried to describe the use case in the first post, but I can add more color if I did not convey the use case very well. Basically, this is not a "need," it just seems like something that should be possible and it was surprising, as a user, that it didn't work. If it's not possible, or increases risk, then no worries.

mbookman commented 5 years ago

Got it. I let's leave this open and we will document that buckets should only be mounted once and that people should use --env variables to point to specific locations inside of a mount. Should bucket mounting be the actual solution they need.

Thanks!