DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
261 stars 43 forks source link

Using Hail with dsub #265

Open buutrg opened 1 year ago

buutrg commented 1 year ago

Hi all, I am trying to use hail via dsub to extract a subset of variants in All of Us server. I think this is the most relevant image I can use https://github.com/DataBiosphere/terra-docker/tree/master/terra-jupyter-hail

But it result in error that pyspark is not found. I tried to install pyspark from https://dlcdn.apache.org/spark/spark-3.1.3/spark-3.1.3-bin-hadoop3.tgz. Now it says No FileSystem for scheme "gs".

May I ask do you have any idea how to use hail via dsub? Your help is really appreciated!

wnojopra commented 1 year ago

It sounds like you're running dsub on the AoU platform. From this Aou Support Article, "Within the Researcher Workbench, internet access is restricted from batch VMs. With the exception of Google APIs, VMs are unable to send or receive network traffic including files, APIs, or packages/code". This isn't specifically a dsub issue. Please reach out to AoU support for help installing pyspark in that image.