coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Initial User Questions … #256

Closed adonoho closed 10 months ago

adonoho commented 10 months ago

Gentlefolk,

I have a small system that manages embarrassingly parallel tasks and writes the resulting dataframes to Google Big Query. I have a few questions about what I need to do to use Coiled as my cluster creator/manager.

  1. Currently, I have Google BigQuery service account credentials in my root node's ~/.config/gcloud directory. How do I manage secrets in a Coiled managed environment? Managing secrets is not particularly visible in the Coiled documentation. As I am using GCP for my coiled cluster, is this a moot question or am I stumbling into a world of hurt? I also have a student that wants to use a commercial solver, which requires a license file, perhaps even access to a license server.
  2. It seems a little extreme that I have to package an import module as a private pip installable module. As we are a conda/mamba shop, do you have instructions on how to package private code up for it?

Thanks for your time perusing my questions.

Anon, Andrew

shughes-uk commented 10 months ago

Hi!

  1. For data access in GCP, I believe the google SDK is smart enough to automatically use the service account we attach to every cluster you make. We have documentation on customizing it can be found here if you want to add extra permissions. As for general secrets management, you can use send_private_envs to handle environment variables, and possibly the the dask upload_file method to get your license file on the workers. In our default configuration, all workers have access to the public internet, would that be enough to talk to the license server?
  2. If you are able to use the package sync feature, this is all handled for you. Any importable python code on your local machine should be automatically available on the cluster. Our docs might be slightly out of date as this is a newer feature.

Cheers!

adonoho commented 10 months ago

Hi, Ms. Hughes, I presume,

Translating systems engineer speak to scientific programmer speak is sometimes difficult. Allow me to compare notes. The known good dask driven service account that writes to BGQ is (from the console):

pandas-gbq-datasource@hs-deep-lab-donoho.iam.gserviceaccount.com
BigQuery Data Editor
BigQuery Data Owner
BigQuery Data Viewer
BigQuery Job User
BigQuery Metadata Viewer

Your default data-role YAML file is:

title: coiled-data
description: service account attached to cluster instances for data access
stage: GA
includedPermissions:
- logging.logEntries.create
- storage.buckets.get
- storage.buckets.create
- storage.objects.create
- storage.objects.get
- storage.objects.list
- storage.objects.update

Based upon your YAML, I need to add the following to your service account:

- roles/bigquery.dataEditor
- roles/bigquery.dataOwner
- roles/bigquery.dataViewer
- roles/bigquery.jobUser
- roles/bigquery.metadataViewer

Then the next issue is where is Coiled hiding the default credentials at runtime? The Pandas-GBQ library wants me to pass in credentials.

Anon, Andrew

shughes-uk commented 10 months ago

Hi there,

The permissions you are adding look correct based on what you've said.

The credentials question is a bit puzzling, is pandas-gbq prompting you for credentials? When using a coiled cluster it should be capable of automatically detecting and using the credentials of the service account attached to the cluster.

Cheers

adonoho commented 10 months ago

Ms. Hughes,

Thanks you.

As a noob to using GCP, I've done all my Dask work using local servers, I use the credentials as an indicator to use GBQ as my remote storage system. I suspect there is a way for me to query GCP and get the credentials to continue using them as my indicator.

Thank you and Anon, Andrew

ntabris commented 10 months ago

@adonoho very soon (maybe later today) we'll be adding the ability for Coiled to make use of your locally configured GCP credentials by shipping an OAuth2 token that you can then use in your code. There will be a doc with code samples. We can share a link once that goes public, and maybe that will be an approach that's easier for you to use.

adonoho commented 10 months ago

BTW, one of the main reasons to use Coiled is to remain as 'noobish' as possible with respect to GCP. Thank you for helping.

adonoho commented 10 months ago

Also, I attended one of your classes about about how to use Dask about a year ago. It was quite useful. I suspect a GCP as well as a AWS class might be helpful too.

ntabris commented 10 months ago

BTW, one of the main reasons to use Coiled is to remain as 'noobish' as possible with respect to GCP.

Totally agree! Hopefully the new approach will be pretty straightforward so that if you have code that works locally, you can get it working with the same permission on the cluster. And if that's not true, happy to work with you so we can get closer to that goal.

adonoho commented 10 months ago

I've now built my first pip package because you require it. Coiled's documentation was helpful but incomplete. Primarily, this is documentation for someone moving a module into a package for the first time, like me. Also, I am a research engineer. I've never had to make a PyPI package before. I suspect most noob users are like me. Hence, spelling out exactly what command to use would be helpful. Finally, I am a conda/mamba user and have been conditioned by that community to NEVER use pip. How I need to make changes to my environment.yml file would be helpful.

adonoho commented 10 months ago

The permissions you are adding look correct based on what you've said.

Even though I got these symbols from gcloud docs, it doesn't like them:

gcloud iam roles create coiled_data --project project-id --file coiled-data-role.yaml

Produces:

ERROR: (gcloud.iam.roles.create) INVALID_ARGUMENT: Permission roles/bigquery.dataOwner is not valid.

For the record, here is the YAML file:

title: coiled-data
description: Service account attached to cluster instances for data and BigQuery access.
stage: GA
includedPermissions:
- logging.logEntries.create
- storage.buckets.get
- storage.buckets.create
- storage.objects.create
- storage.objects.get
- storage.objects.list
- storage.objects.update
- roles/bigquery.dataEditor
- roles/bigquery.dataOwner
- roles/bigquery.dataViewer
- roles/bigquery.jobUser
- roles/bigquery.metadataViewer

Note: roles/bigquery.dataOwner is the second permission that uses the roles/ prefix.

Second Note: Security is always fun!

As a repackager of GCP services, Coiled is likely to have more leverage in getting Google to support us than this lowly research engineer. Also, this is BigQuery we are talking about, Google's premier noSQL database/data warehouse product. This should not be this hard. Thank you for your help.

Anon, Andrew

adonoho commented 10 months ago

On the good news front, I have successfully created the PyPI stuff to incorporate my tiny module into Coiled's package scanning process.

ntabris commented 10 months ago

It looks like you're trying to edit the coiled-data role. What you want to do is add those additional roles to the coiled-data@<your-project>.iam.gserviceaccount.com Service Account. You should be able to find service accounts at https://console.cloud.google.com/iam-admin/serviceaccounts

Apologies that this is all a bit confusing. We're hoping that the new functionality to ship user credentials will simplify things.

ntabris commented 10 months ago

If you want to try using your local credentials rather than configuring the service account, take a look at the (just released) explanation of using person oauth2 token under https://docs.coiled.io/user_guide/remote-data-access.html#gcp

This requires coiled 0.9.14.

dan-blanchard commented 10 months ago

On the good news front, I have successfully created the PyPI stuff to incorporate my tiny module into Coiled's package scanning process.

You actually should not need to follow those steps anymore, unless your project is a bit more complex, as we scan for any importable Python packages and zip those up and put them on the cluster. I'll make a note that we should update those docs.

adonoho commented 10 months ago

It looks like you're trying to edit the coiled-data role. What you want to do is add those additional roles to the coiled-data@<your-project>.iam.gserviceaccount.com Service Account. You should be able to find service accounts at https://console.cloud.google.com/iam-admin/serviceaccounts

Apologies that this is all a bit confusing. We're hoping that the new functionality to ship user credentials will simplify things.

I have successfully added my required permissions to the coiled-data service account. In the past, I would just get the json data and install it in a secret place. (Yes, I know this isn't terribly secure. This is also a prototype tested on a locally controlled server.) Of course, this is a bad idea in a cloud environment. This brings me back to how and where do I store secrets for coiled?

Edit: I've figured out how service account names are used instead of the project_id in CGP.

And how do I get access to those secrets. If you have special access, what APIs do I call to get the coiled-data role credential from coiled?

shughes-uk commented 10 months ago

It looks like you're trying to edit the coiled-data role. What you want to do is add those additional roles to the coiled-data@<your-project>.iam.gserviceaccount.com Service Account. You should be able to find service accounts at https://console.cloud.google.com/iam-admin/serviceaccounts Apologies that this is all a bit confusing. We're hoping that the new functionality to ship user credentials will simplify things.

I have successfully added my required permissions to the coiled-data service account. In the past, I would just get the json data and install it in a secret place. (Yes, I know this isn't terribly secure. This is also a prototype tested on a locally controlled server.) Of course, this is a bad idea in a cloud environment. This brings me back to how and where do I store secrets for coiled? And how do I get access to those secrets. If you have special access, what APIs do I call to get the coiled-data role credential from coiled?

The libraries you are using to access GCP resources should automatically be able to use the attached service account. You won't need to add the GCP secrets.

As for other secrets you can refer to my first reply this this issue on how to handle them.

adonoho commented 10 months ago

Ms. Hughes,

Thank you for engaging.

I have "discovered" that GCP uses the project_id to hold the credential name. While this is a poor API choice, IMO, it is what it is.

Thank you for bearing with this noob in trying to migrate to using your service. If it is any consolation, this Experiment Management System is going to be used this fall for a massively parallel computing for statistics class. It needs to be simple and easy for statisticians/data scientists to use. That is why we are evaluating Coiled.io.

adonoho commented 10 months ago

If you want to try using your local credentials rather than configuring the service account, take a look at the (just released) explanation of using person oauth2 token under https://docs.coiled.io/user_guide/remote-data-access.html#gcp

This requires coiled 0.9.14.

Reading the above, I find that the following command, suggested in the above link, downgrades many of my google modules:

mamba install google-cloud-iam

I'm bringing this up because it is very clear that GCP has quirks. They may not be visible to your sophisticated developers. But to us noobs, they are extremely visible.

Regardless, the command line results in:

Pinned packages:
  - python 3.10.*

Transaction

  Prefix: /Users/awd/mambaforge/envs/MatrixRecovery

  Updating specs:

   - google-cloud-iam
   - ca-certificates
   - certifi
   - openssl

  Package                               Version  Build            Channel                Size
───────────────────────────────────────────────────────────────────────────────────────────────
  Install:
───────────────────────────────────────────────────────────────────────────────────────────────

  + google-cloud-iam                     2.12.1  pyhd8ed1ab_0     conda-forge/noarch     44kB
  + libcst                                1.0.1  py310h896817c_0  conda-forge/osx-64      2MB
  + mypy_extensions                       1.0.0  pyha770c72_0     conda-forge/noarch     10kB
  + typing_inspect                        0.9.0  pyhd8ed1ab_0     conda-forge/noarch     15kB

  Change:
───────────────────────────────────────────────────────────────────────────────────────────────

  - google-auth-oauthlib                  1.0.0  pyhd8ed1ab_1     conda-forge                
  + google-auth-oauthlib                  1.0.0  pyhd8ed1ab_0     conda-forge/noarch     21kB

  Downgrade:
───────────────────────────────────────────────────────────────────────────────────────────────

  - cachetools                            5.3.1  pyhd8ed1ab_0     conda-forge                
  + cachetools                            4.2.4  pyhd8ed1ab_0     conda-forge/noarch     13kB
  - google-api-core                      2.11.1  pyhd8ed1ab_0     conda-forge                
  + google-api-core                      1.31.5  pyhd8ed1ab_0     conda-forge/noarch     61kB
  - google-api-core-grpc                 2.11.1  hd8ed1ab_0       conda-forge                
  + google-api-core-grpc                 1.31.5  hd8ed1ab_0       conda-forge/noarch      4kB
  - google-auth                          2.22.0  pyh1a96a4e_0     conda-forge                
  + google-auth                          1.35.0  pyh6c4a22f_0     conda-forge/noarch     83kB
  - google-cloud-bigquery-storage        2.18.0  pyh1a96a4e_0     conda-forge                
  + google-cloud-bigquery-storage        2.11.0  pyh6c4a22f_0     conda-forge/noarch      8kB
  - google-cloud-bigquery-storage-core   2.18.0  pyh1a96a4e_0     conda-forge                
  + google-cloud-bigquery-storage-core   2.11.0  pyh6c4a22f_0     conda-forge/noarch     62kB
  - google-cloud-core                     2.3.3  pyhd8ed1ab_0     conda-forge                
  + google-cloud-core                     2.3.1  pyhd8ed1ab_0     conda-forge/noarch     28kB
  - pandas-gbq                           0.19.2  pyh1a96a4e_0     conda-forge                
  + pandas-gbq                           0.13.2  pyh9f0ad1d_0     conda-forge/noarch     23kB

  Summary:

  Install: 4 packages
  Change: 1 packages
  Downgrade: 8 packages

  Total download: 2MB

The pandas-gbq downgrade is particularly difficult to embrace.

dan-blanchard commented 10 months ago

You might want to try recreating your conda environment, because when I create one from scratch with pandas-gbq, cachetools, google-cloud-iam, coiled, and python~=3.10.0, I get a much newer version of pandas-gbq. Granted, this may be because I'm on an M1 Mac instead of an Intel one.

I end up with (many packages elided so this comment isn't enormous):

  + cachetools                                 5.3.1  pyhd8ed1ab_0          conda-forge/noarch          15kB
  + coiled                                    0.9.14  pyhd8ed1ab_0          conda-forge/noarch         137kB
  + google-api-core                           1.34.0  pyhd8ed1ab_0          conda-forge/noarch          77kB
  + google-api-core-grpc                      1.34.0  hd8ed1ab_0            conda-forge/noarch           6kB
  + google-auth                               2.22.0  pyh1a96a4e_0          conda-forge/noarch         102kB
  + google-auth-oauthlib                       1.0.0  pyhd8ed1ab_1          conda-forge/noarch          21kB
  + google-cloud-bigquery                      3.1.0  pyhd8ed1ab_0          conda-forge/noarch           8kB
  + google-cloud-bigquery-core                 3.1.0  pyhd8ed1ab_0          conda-forge/noarch         138kB
  + google-cloud-bigquery-storage             2.18.0  pyh1a96a4e_0          conda-forge/noarch          10kB
  + google-cloud-bigquery-storage-core        2.18.0  pyh1a96a4e_0          conda-forge/noarch          62kB
  + google-cloud-core                          2.3.3  pyhd8ed1ab_0          conda-forge/noarch          29kB
  + google-cloud-iam                          2.12.1  pyhd8ed1ab_0          conda-forge/noarch          44kB
  + google-crc32c                              1.1.2  py310he58995c_4       conda-forge/osx-arm64       24kB
  + google-resumable-media                     2.5.0  pyhd8ed1ab_0          conda-forge/noarch          44kB
  + googleapis-common-protos                  1.60.0  pyhd8ed1ab_0          conda-forge/noarch         121kB
  + pandas                                     2.0.3  py310h1cdf563_1       conda-forge/osx-arm64       12MB
  + pandas-gbq                                0.17.9  pyh1a96a4e_0          conda-forge/noarch          25kB
adonoho commented 10 months ago

You might want to try recreating your conda environment, because when I create one from scratch with pandas-gbq, cachetools, google-cloud-iam, coiled, and python~=3.10.0, I get a much newer version of pandas-gbq. Granted, this may be because I'm on an M1 Mac instead of an Intel one.

Mr. Blanchard,

Thank you for your suggestion.

The reason I have an environment.yml is to destroy the environment often … and I've done so. We'll see if I need google-cloud-iam feature. As we all know, security is a maze of twisty little passages, all alike.

In other news, as I've started down the include modules as PIP packages path, allow me to request an update to your documentation re: private repositories. Please specify which exact GitHub Personal Access Token permissions Coiled needs to read/use a github private repository. I'm starting with read-only to contents. (BTW, a single page tutorial that integrates all of this stuff would also be helpful.)

Also, I tried to revert my changes to see if Coiled's fancy automatic package selection worked. It had problems. Happy to share info if Coiled wants. I returned to using my environment.yml file.

Anon, Andrew

P.S. I'm leaving this thread open until I successfully run our code on a Coiled managed cluster.

adonoho commented 10 months ago

Well, this has been fun and educational.

Lets discuss environment.yml issues.

Here's a known good and working environment.yml file for an embarassingly parallel task. It has literally been used to calculate millions of values and we want to scale it up to use multiple servers. A perfect example of iterative development:

name: MatrixRecovery
channels:
  - conda-forge
  - defaults
dependencies:
  - blas[build=mkl]
  - numpy
  - python=3.10
  - pandas-gbq
  - cvxpy
  - dask
  - coiled
  - sqlalchemy
  - pg8000
  - cloud-sql-python-connector
  - pip
  - pip:
    - git+https://${GITHUB_USER}:${GITHUB_TOKEN}@github.com/adonoho/EMS.git
variables:
  MKL_NUM_THREADS: '1'
  OPENBLAS_NUM_THREADS: '1'
prefix: /Users/awd/opt/anaconda3/envs/MatrixRecovery

It has been updated to be a private GitHub repo implementing a PyPI package. Please note the environment variables. These are a Matt Rocklin strong suggestion when using Dask to manage parallelism. I doubt automatic package scanning is going to pick these up. All this repackaging as a PyPI package was done to support including a simple Python module consisting of two files: __init__.py and manager.py. While not difficult, it was tedious and was not something I needed to do to use Dask on my laptop and local servers. But that wasn't enough, to use Coiled, I needed to refactor it for Coiled. (Yes, I'm grumpy.) First, my laptop/server code:

def do_local_experiment():
    exp = test_experiment()
    with LocalCluster(dashboard_address='localhost:8787') as cluster:
        with Client(cluster) as client:
            do_on_cluster(exp, block_bp_instance_df, client, credentials=get_gbq_credentials())

My Coiled code:

def do_coiled_experiment():
    exp = test_experiment()
    coiled.create_software_environment(
        name="adonoho/matrix_recovery",
        conda="environment-coiled.yml",
        pip=[
            "git+https://GIT_TOKEN@github.com/adonoho/EMS.git"
        ]
    )
    with coiled.Cluster(n_workers=16) as cluster:
        with Client(cluster) as client:
            do_on_cluster(exp, block_bp_instance_df, client, project_id='coiled-data@xxxx-xxxx-xxxx.iam.gserviceaccount.com')

And the pip code is now elided in the environment.yml file:

name: MatrixRecovery
channels:
  - conda-forge
  - defaults
dependencies:
  - blas[build=mkl]
  - numpy
  - python=3.10
  - pandas-gbq
  - cvxpy
  - dask
  - coiled
  - sqlalchemy
  - pg8000
  - cloud-sql-python-connector
variables:
  MKL_NUM_THREADS: '1'
  OPENBLAS_NUM_THREADS: '1'
prefix: /Users/awd/opt/anaconda3/envs/MatrixRecovery

From a software engineering perspective, this is really suboptimal. (Yes, a little Googling reveals that YAML supports including external files.) Regardless, you are making the activation energy to embrace Coiled pretty high. I will note that I have also used SaturnCloud.io. While they have their own issues, they were certainly easier to scale up.

Finally, the above builds the environment install all of the code and then the 16 workers all fail to start:

INFO:/Users/awd/mambaforge/envs/MatrixRecovery/lib/python3.10/site-packages/coiled/software.py:Attempting to load environment file environment-coiled.yml
INFO:coiled:Creating software environment
INFO:coiled:Software environment already built
INFO:coiled:Software environment created
INFO:coiled:Resolving your local Python environment...
INFO:coiled:Creating Cluster (name: adonoho-a324f55e-c, https://cloud.coiled.io/clusters/259775?account=adonoho ). This usually takes 1-2 minutes...
ERROR:coiled:   | Worker Process         | adonoho-a324f55e-c-worker-a9f56794d6           | error      at 17:42:49 (CDT) | Software build failed -> Conda package install failed with the following errors:

package cairo-1.12.18-7 requires icu 56.*, but none of the providers can be installed

Consider creating a new environment.
By specifying your packages at once, you're more likely to get a consistent set of versions.

The irony is pretty rich that I'm being told to create a new environment. Sigh.

I feel that I am quite close to successfully invoking Coiled. I hope you can help.

Anon, Andrew

shughes-uk commented 10 months ago

Hi there, it looks like you've successfully started a cluster using package sync. Just so you know, your create_software_environment call is actually redundant as you don't pass the software environment name to the Cluster object, which means it defaults to package sync and does not use your software environment.

Cheers

adonoho commented 10 months ago

OK, I do not actually want it to use automatic package sync. I want it to use the provided environment-coiled.yml file. Regardless, the system is not resolving packages properly. It is specifically NOT using the specification I want it to use. This is a problem.