Closed adonoho closed 1 year ago
Hi!
send_private_envs
to handle environment variables, and possibly the the dask upload_file
method to get your license file on the workers. In our default configuration, all workers have access to the public internet, would that be enough to talk to the license server?Cheers!
Hi, Ms. Hughes, I presume,
Translating systems engineer speak to scientific programmer speak is sometimes difficult. Allow me to compare notes. The known good dask driven service account that writes to BGQ is (from the console):
pandas-gbq-datasource@hs-deep-lab-donoho.iam.gserviceaccount.com
BigQuery Data Editor
BigQuery Data Owner
BigQuery Data Viewer
BigQuery Job User
BigQuery Metadata Viewer
Your default data-role YAML file is:
title: coiled-data
description: service account attached to cluster instances for data access
stage: GA
includedPermissions:
- logging.logEntries.create
- storage.buckets.get
- storage.buckets.create
- storage.objects.create
- storage.objects.get
- storage.objects.list
- storage.objects.update
Based upon your YAML, I need to add the following to your service account:
- roles/bigquery.dataEditor
- roles/bigquery.dataOwner
- roles/bigquery.dataViewer
- roles/bigquery.jobUser
- roles/bigquery.metadataViewer
Then the next issue is where is Coiled hiding the default credentials at runtime? The Pandas-GBQ library wants me to pass in credentials.
Anon, Andrew
Hi there,
The permissions you are adding look correct based on what you've said.
The credentials question is a bit puzzling, is pandas-gbq
prompting you for credentials? When using a coiled cluster it should be capable of automatically detecting and using the credentials of the service account attached to the cluster.
Cheers
Ms. Hughes,
Thanks you.
As a noob to using GCP, I've done all my Dask work using local servers, I use the credentials as an indicator to use GBQ as my remote storage system. I suspect there is a way for me to query GCP and get the credentials to continue using them as my indicator.
Thank you and Anon, Andrew
@adonoho very soon (maybe later today) we'll be adding the ability for Coiled to make use of your locally configured GCP credentials by shipping an OAuth2 token that you can then use in your code. There will be a doc with code samples. We can share a link once that goes public, and maybe that will be an approach that's easier for you to use.
BTW, one of the main reasons to use Coiled is to remain as 'noobish' as possible with respect to GCP. Thank you for helping.
Also, I attended one of your classes about about how to use Dask about a year ago. It was quite useful. I suspect a GCP as well as a AWS class might be helpful too.
BTW, one of the main reasons to use Coiled is to remain as 'noobish' as possible with respect to GCP.
Totally agree! Hopefully the new approach will be pretty straightforward so that if you have code that works locally, you can get it working with the same permission on the cluster. And if that's not true, happy to work with you so we can get closer to that goal.
I've now built my first pip package because you require it. Coiled's documentation was helpful but incomplete. Primarily, this is documentation for someone moving a module into a package for the first time, like me. Also, I am a research engineer. I've never had to make a PyPI package before. I suspect most noob users are like me. Hence, spelling out exactly what command to use would be helpful. Finally, I am a conda/mamba user and have been conditioned by that community to NEVER use pip. How I need to make changes to my environment.yml
file would be helpful.
The permissions you are adding look correct based on what you've said.
Even though I got these symbols from gcloud docs, it doesn't like them:
gcloud iam roles create coiled_data --project project-id --file coiled-data-role.yaml
Produces:
ERROR: (gcloud.iam.roles.create) INVALID_ARGUMENT: Permission roles/bigquery.dataOwner is not valid.
For the record, here is the YAML file:
title: coiled-data
description: Service account attached to cluster instances for data and BigQuery access.
stage: GA
includedPermissions:
- logging.logEntries.create
- storage.buckets.get
- storage.buckets.create
- storage.objects.create
- storage.objects.get
- storage.objects.list
- storage.objects.update
- roles/bigquery.dataEditor
- roles/bigquery.dataOwner
- roles/bigquery.dataViewer
- roles/bigquery.jobUser
- roles/bigquery.metadataViewer
Note: roles/bigquery.dataOwner
is the second permission that uses the roles/
prefix.
Second Note: Security is always fun!
As a repackager of GCP services, Coiled is likely to have more leverage in getting Google to support us than this lowly research engineer. Also, this is BigQuery we are talking about, Google's premier noSQL database/data warehouse product. This should not be this hard. Thank you for your help.
Anon, Andrew
On the good news front, I have successfully created the PyPI stuff to incorporate my tiny module into Coiled's package scanning process.
It looks like you're trying to edit the coiled-data
role. What you want to do is add those additional roles to the coiled-data@<your-project>.iam.gserviceaccount.com
Service Account. You should be able to find service accounts at https://console.cloud.google.com/iam-admin/serviceaccounts
Apologies that this is all a bit confusing. We're hoping that the new functionality to ship user credentials will simplify things.
If you want to try using your local credentials rather than configuring the service account, take a look at the (just released) explanation of using person oauth2 token under https://docs.coiled.io/user_guide/remote-data-access.html#gcp
This requires coiled 0.9.14.
On the good news front, I have successfully created the PyPI stuff to incorporate my tiny module into Coiled's package scanning process.
You actually should not need to follow those steps anymore, unless your project is a bit more complex, as we scan for any importable Python packages and zip those up and put them on the cluster. I'll make a note that we should update those docs.
It looks like you're trying to edit the
coiled-data
role. What you want to do is add those additional roles to thecoiled-data@<your-project>.iam.gserviceaccount.com
Service Account. You should be able to find service accounts at https://console.cloud.google.com/iam-admin/serviceaccountsApologies that this is all a bit confusing. We're hoping that the new functionality to ship user credentials will simplify things.
I have successfully added my required permissions to the coiled-data
service account. In the past, I would just get the json
data and install it in a secret place. (Yes, I know this isn't terribly secure. This is also a prototype tested on a locally controlled server.) Of course, this is a bad idea in a cloud environment. This brings me back to how and where do I store secrets for coiled?
Edit: I've figured out how service account names are used instead of the project_id
in CGP.
And how do I get access to those secrets. If you have special access, what APIs do I call to get the coiled-data role credential from coiled
?
It looks like you're trying to edit the
coiled-data
role. What you want to do is add those additional roles to thecoiled-data@<your-project>.iam.gserviceaccount.com
Service Account. You should be able to find service accounts at https://console.cloud.google.com/iam-admin/serviceaccounts Apologies that this is all a bit confusing. We're hoping that the new functionality to ship user credentials will simplify things.I have successfully added my required permissions to the
coiled-data
service account. In the past, I would just get thejson
data and install it in a secret place. (Yes, I know this isn't terribly secure. This is also a prototype tested on a locally controlled server.) Of course, this is a bad idea in a cloud environment. This brings me back to how and where do I store secrets for coiled? And how do I get access to those secrets. If you have special access, what APIs do I call to get the coiled-data role credential fromcoiled
?
The libraries you are using to access GCP resources should automatically be able to use the attached service account. You won't need to add the GCP secrets.
As for other secrets you can refer to my first reply this this issue on how to handle them.
Ms. Hughes,
Thank you for engaging.
I have "discovered" that GCP uses the project_id
to hold the credential name. While this is a poor API choice, IMO, it is what it is.
Thank you for bearing with this noob in trying to migrate to using your service. If it is any consolation, this Experiment Management System is going to be used this fall for a massively parallel computing for statistics class. It needs to be simple and easy for statisticians/data scientists to use. That is why we are evaluating Coiled.io.
If you want to try using your local credentials rather than configuring the service account, take a look at the (just released) explanation of using person oauth2 token under https://docs.coiled.io/user_guide/remote-data-access.html#gcp
This requires coiled 0.9.14.
Reading the above, I find that the following command, suggested in the above link, downgrades many of my google modules:
mamba install google-cloud-iam
I'm bringing this up because it is very clear that GCP has quirks. They may not be visible to your sophisticated developers. But to us noobs
, they are extremely visible.
Regardless, the command line results in:
Pinned packages:
- python 3.10.*
Transaction
Prefix: /Users/awd/mambaforge/envs/MatrixRecovery
Updating specs:
- google-cloud-iam
- ca-certificates
- certifi
- openssl
Package Version Build Channel Size
───────────────────────────────────────────────────────────────────────────────────────────────
Install:
───────────────────────────────────────────────────────────────────────────────────────────────
+ google-cloud-iam 2.12.1 pyhd8ed1ab_0 conda-forge/noarch 44kB
+ libcst 1.0.1 py310h896817c_0 conda-forge/osx-64 2MB
+ mypy_extensions 1.0.0 pyha770c72_0 conda-forge/noarch 10kB
+ typing_inspect 0.9.0 pyhd8ed1ab_0 conda-forge/noarch 15kB
Change:
───────────────────────────────────────────────────────────────────────────────────────────────
- google-auth-oauthlib 1.0.0 pyhd8ed1ab_1 conda-forge
+ google-auth-oauthlib 1.0.0 pyhd8ed1ab_0 conda-forge/noarch 21kB
Downgrade:
───────────────────────────────────────────────────────────────────────────────────────────────
- cachetools 5.3.1 pyhd8ed1ab_0 conda-forge
+ cachetools 4.2.4 pyhd8ed1ab_0 conda-forge/noarch 13kB
- google-api-core 2.11.1 pyhd8ed1ab_0 conda-forge
+ google-api-core 1.31.5 pyhd8ed1ab_0 conda-forge/noarch 61kB
- google-api-core-grpc 2.11.1 hd8ed1ab_0 conda-forge
+ google-api-core-grpc 1.31.5 hd8ed1ab_0 conda-forge/noarch 4kB
- google-auth 2.22.0 pyh1a96a4e_0 conda-forge
+ google-auth 1.35.0 pyh6c4a22f_0 conda-forge/noarch 83kB
- google-cloud-bigquery-storage 2.18.0 pyh1a96a4e_0 conda-forge
+ google-cloud-bigquery-storage 2.11.0 pyh6c4a22f_0 conda-forge/noarch 8kB
- google-cloud-bigquery-storage-core 2.18.0 pyh1a96a4e_0 conda-forge
+ google-cloud-bigquery-storage-core 2.11.0 pyh6c4a22f_0 conda-forge/noarch 62kB
- google-cloud-core 2.3.3 pyhd8ed1ab_0 conda-forge
+ google-cloud-core 2.3.1 pyhd8ed1ab_0 conda-forge/noarch 28kB
- pandas-gbq 0.19.2 pyh1a96a4e_0 conda-forge
+ pandas-gbq 0.13.2 pyh9f0ad1d_0 conda-forge/noarch 23kB
Summary:
Install: 4 packages
Change: 1 packages
Downgrade: 8 packages
Total download: 2MB
The pandas-gbq
downgrade is particularly difficult to embrace.
You might want to try recreating your conda environment, because when I create one from scratch with pandas-gbq
, cachetools
, google-cloud-iam
, coiled
, and python~=3.10.0
, I get a much newer version of pandas-gbq. Granted, this may be because I'm on an M1 Mac instead of an Intel one.
I end up with (many packages elided so this comment isn't enormous):
+ cachetools 5.3.1 pyhd8ed1ab_0 conda-forge/noarch 15kB
+ coiled 0.9.14 pyhd8ed1ab_0 conda-forge/noarch 137kB
+ google-api-core 1.34.0 pyhd8ed1ab_0 conda-forge/noarch 77kB
+ google-api-core-grpc 1.34.0 hd8ed1ab_0 conda-forge/noarch 6kB
+ google-auth 2.22.0 pyh1a96a4e_0 conda-forge/noarch 102kB
+ google-auth-oauthlib 1.0.0 pyhd8ed1ab_1 conda-forge/noarch 21kB
+ google-cloud-bigquery 3.1.0 pyhd8ed1ab_0 conda-forge/noarch 8kB
+ google-cloud-bigquery-core 3.1.0 pyhd8ed1ab_0 conda-forge/noarch 138kB
+ google-cloud-bigquery-storage 2.18.0 pyh1a96a4e_0 conda-forge/noarch 10kB
+ google-cloud-bigquery-storage-core 2.18.0 pyh1a96a4e_0 conda-forge/noarch 62kB
+ google-cloud-core 2.3.3 pyhd8ed1ab_0 conda-forge/noarch 29kB
+ google-cloud-iam 2.12.1 pyhd8ed1ab_0 conda-forge/noarch 44kB
+ google-crc32c 1.1.2 py310he58995c_4 conda-forge/osx-arm64 24kB
+ google-resumable-media 2.5.0 pyhd8ed1ab_0 conda-forge/noarch 44kB
+ googleapis-common-protos 1.60.0 pyhd8ed1ab_0 conda-forge/noarch 121kB
+ pandas 2.0.3 py310h1cdf563_1 conda-forge/osx-arm64 12MB
+ pandas-gbq 0.17.9 pyh1a96a4e_0 conda-forge/noarch 25kB
You might want to try recreating your conda environment, because when I create one from scratch with
pandas-gbq
,cachetools
,google-cloud-iam
,coiled
, andpython~=3.10.0
, I get a much newer version of pandas-gbq. Granted, this may be because I'm on an M1 Mac instead of an Intel one.
Mr. Blanchard,
Thank you for your suggestion.
The reason I have an environment.yml
is to destroy the environment often … and I've done so. We'll see if I need google-cloud-iam
feature. As we all know, security is a maze of twisty little passages, all alike.
In other news, as I've started down the include modules as PIP packages path, allow me to request an update to your documentation re: private repositories. Please specify which exact GitHub Personal Access Token permissions Coiled needs to read/use a github private repository. I'm starting with read-only to contents. (BTW, a single page tutorial that integrates all of this stuff would also be helpful.)
Also, I tried to revert my changes to see if Coiled's fancy automatic package selection worked. It had problems. Happy to share info if Coiled wants. I returned to using my environment.yml
file.
Anon, Andrew
P.S. I'm leaving this thread open until I successfully run our code on a Coiled managed cluster.
Well, this has been fun and educational.
Lets discuss environment.yml
issues.
Here's a known good and working environment.yml
file for an embarassingly parallel task. It has literally been used to calculate millions of values and we want to scale it up to use multiple servers. A perfect example of iterative development:
name: MatrixRecovery
channels:
- conda-forge
- defaults
dependencies:
- blas[build=mkl]
- numpy
- python=3.10
- pandas-gbq
- cvxpy
- dask
- coiled
- sqlalchemy
- pg8000
- cloud-sql-python-connector
- pip
- pip:
- git+https://${GITHUB_USER}:${GITHUB_TOKEN}@github.com/adonoho/EMS.git
variables:
MKL_NUM_THREADS: '1'
OPENBLAS_NUM_THREADS: '1'
prefix: /Users/awd/opt/anaconda3/envs/MatrixRecovery
It has been updated to be a private GitHub repo implementing a PyPI package. Please note the environment variables. These are a Matt Rocklin
strong suggestion when using Dask to manage parallelism. I doubt automatic package scanning is going to pick these up. All this repackaging as a PyPI package was done to support including a simple Python module consisting of two files: __init__.py
and manager.py
. While not difficult, it was tedious and was not something I needed to do to use Dask on my laptop and local servers. But that wasn't enough, to use Coiled, I needed to refactor it for Coiled. (Yes, I'm grumpy.) First, my laptop/server code:
def do_local_experiment():
exp = test_experiment()
with LocalCluster(dashboard_address='localhost:8787') as cluster:
with Client(cluster) as client:
do_on_cluster(exp, block_bp_instance_df, client, credentials=get_gbq_credentials())
My Coiled code:
def do_coiled_experiment():
exp = test_experiment()
coiled.create_software_environment(
name="adonoho/matrix_recovery",
conda="environment-coiled.yml",
pip=[
"git+https://GIT_TOKEN@github.com/adonoho/EMS.git"
]
)
with coiled.Cluster(n_workers=16) as cluster:
with Client(cluster) as client:
do_on_cluster(exp, block_bp_instance_df, client, project_id='coiled-data@xxxx-xxxx-xxxx.iam.gserviceaccount.com')
And the pip
code is now elided in the environment.yml
file:
name: MatrixRecovery
channels:
- conda-forge
- defaults
dependencies:
- blas[build=mkl]
- numpy
- python=3.10
- pandas-gbq
- cvxpy
- dask
- coiled
- sqlalchemy
- pg8000
- cloud-sql-python-connector
variables:
MKL_NUM_THREADS: '1'
OPENBLAS_NUM_THREADS: '1'
prefix: /Users/awd/opt/anaconda3/envs/MatrixRecovery
From a software engineering perspective, this is really suboptimal. (Yes, a little Googling reveals that YAML supports including external files.) Regardless, you are making the activation energy to embrace Coiled pretty high. I will note that I have also used SaturnCloud.io. While they have their own issues, they were certainly easier to scale up.
Finally, the above builds the environment install all of the code and then the 16 workers all fail to start:
INFO:/Users/awd/mambaforge/envs/MatrixRecovery/lib/python3.10/site-packages/coiled/software.py:Attempting to load environment file environment-coiled.yml
INFO:coiled:Creating software environment
INFO:coiled:Software environment already built
INFO:coiled:Software environment created
INFO:coiled:Resolving your local Python environment...
INFO:coiled:Creating Cluster (name: adonoho-a324f55e-c, https://cloud.coiled.io/clusters/259775?account=adonoho ). This usually takes 1-2 minutes...
ERROR:coiled: | Worker Process | adonoho-a324f55e-c-worker-a9f56794d6 | error at 17:42:49 (CDT) | Software build failed -> Conda package install failed with the following errors:
package cairo-1.12.18-7 requires icu 56.*, but none of the providers can be installed
Consider creating a new environment.
By specifying your packages at once, you're more likely to get a consistent set of versions.
The irony is pretty rich that I'm being told to create a new environment. Sigh.
I feel that I am quite close to successfully invoking Coiled. I hope you can help.
Anon, Andrew
Hi there, it looks like you've successfully started a cluster using package sync. Just so you know, your create_software_environment
call is actually redundant as you don't pass the software environment name to the Cluster
object, which means it defaults to package sync and does not use your software environment.
Cheers
OK, I do not actually want it to use automatic package sync. I want it to use the provided environment-coiled.yml
file. Regardless, the system is not resolving packages properly. It is specifically NOT using the specification I want it to use. This is a problem.
Gentlefolk,
I have a small system that manages embarrassingly parallel tasks and writes the resulting dataframes to Google Big Query. I have a few questions about what I need to do to use Coiled as my cluster creator/manager.
~/.config/gcloud
directory. How do I manage secrets in a Coiled managed environment? Managing secrets is not particularly visible in the Coiled documentation. As I am using GCP for my coiled cluster, is this a moot question or am I stumbling into a world of hurt? I also have a student that wants to use a commercial solver, which requires a license file, perhaps even access to a license server.Thanks for your time perusing my questions.
Anon, Andrew