asascience-open / nextgen-dmac

Public repository describing the prototyping efforts and direction of the Next-Gen DMAC project, "Reaching for the Cloud: Architecting a Cloud-Native Service-Based Ecosystem for DMAC"
MIT License
18 stars 4 forks source link

Run Argo workflows on nextgen-dev cluster #30

Closed jonmjoyce closed 6 months ago

jonmjoyce commented 1 year ago
rsignell-usgs commented 1 year ago

@jonmjoyce & @cheryldmorse , Nebari has some kind of built-in support for Argo workflows, but I haven't tested: https://github.com/nebari-dev/nebari/pull/1252

abkfenris commented 1 year ago

@rsignell-usgs when I said Argo, I meant Argo CD, not Argo Workflows.

rsignell-usgs commented 1 year ago

@abkfenris okay, but I thought Jonathan was talking about Argo Workflows in this issue?

abkfenris commented 1 year ago

Ah, I thought that this might have been from you mentioning Argo to him.

rsignell-usgs commented 1 year ago

@jonmjoyce , just curious, what's the use case?

jonmjoyce commented 1 year ago

@benjwadams has been experimenting with ways we can use workflows to perform QA/QC on data, as well as perform any other data harvesting that might need to be done.

rsignell-usgs commented 1 year ago

Would nice to have an update here on the current strategy from @benjwadams

benjwadams commented 1 year ago

Current strategy is to pull down the data from a netCDF file and treat this as an artifact.

The artifact is then used by the QC and compliance checking jobs, which then in turn generate their respective artifacts.

There will either be a Panel or Jupyter Notebook interface for configuring the QC.

Once the Workflows are finished, there may be signals to other processes using Argo Events.

Argo Events may also be used to kick off workflows -- e.g. when a new file has been uploaded.

rsignell-usgs commented 1 year ago

Thanks @benjwadams ! Is this work being done on Nebari deployment that @cheryldmorse set up?
(not sure where that endpoint is). Nebari has Argo workflow integration (though I haven't tried it), and it also has the ability to publish Panel dashboards that other users can see. It might be nice to try the app there since then other Nebari users could follow that pattern?

I could also give you access to the ESIP Nebari deployment at https://nebari.esipfed.org -- just let me know!

mwengren commented 1 year ago

We have a Nebari instance running somewhere???

I'd recommend sharing more details with the project group and steering committee about any resources they can view, such as Nebari for example, so they can get a better idea of the work being done.

I realize a vanilla Nebari instance probably isn't of much interest, but if we had some demo notebooks or other relevant materials the participants could try for themselves it might spur more engagement.

We can discuss this during the upcoming steering committee meeting or share links somewhere outside of GitHub if these resources are protected and/or require user accounts/logins to be made.

rsignell-usgs commented 1 year ago

@benjwadams on the Nebari dev & community call, I asked about the status of Argo workflows on Nebari and they said they are working, but also mentioned there is no great solution for Python API currently. @dharhas mentioned he gave a demo using https://github.com/argoproj-labs/hera-workflows by following an example because the documentation is mostly missing.

What are you using?

dharhas commented 1 year ago

So if you want to use Argo within Nebari, there are a couple of steps that are not in the docs. I can put the details here later this afternoon.

dharhas commented 1 year ago

Please file bugs on the nebari github if you have issues. Documentation and example contributions are welcome as you learn how to use this feature.

Argo CLI

If you want to use the argo cli, you will need to download and unzip the argo cli binary. You can do this using the Jupyter Terminal and placing it somehwere and adding the path via .bash_profile

You do not need the argo cli to use argo from python but it has some useful features if you don't want to use the web GUI.

https://argoproj.github.io/argo-workflows/walk-through/argo-cli/

# Download the binary
curl -sLO https://github.com/argoproj/argo-workflows/releases/download/v3.4.5/argo-linux-amd64.gz

Setting up credentials

Go to.

  1. Go to https:///argo and login with single sign-on
  2. Click on the 'user' icon in the left menu
  3. You will see a section that says 'Using your Login with the CLI', in that section click COPY TO CLIPBOARD and you will get the information below for your setup.
export ARGO_SERVER='<nebari_url>:443' 
export ARGO_HTTP1=true  
export ARGO_SECURE=true
export ARGO_BASE_HREF=argo/
export ARGO_TOKEN= 'Bearer v2:BLAHBLAHBLAH
export ARGO_NAMESPACE=argo ;# or whatever your namespace is 
export KUBECONFIG=/dev/null ;# recommended

# check it works: 
argo list

To use Argo with Python (Hera) one more environment variable needs to be set as you see from the script below, some of the other vars set above are not needed.

export ARGO_TOKEN_TOKEN='v2:BLAHBLAH'

This is the same as the ARGO_TOKEN above but without the string 'Bearer at the beginning 🤷🏽

Python API options

https://github.com/argoproj/argo-workflows/blob/master/docs/client-libraries.md

Argo has two Python API's Couler and Hera.

Couler is better documented but doesn't seem to be actively maintained (last significant work was done 2 years ago) and the API is very verbose.

Hera has a much nice API and seems to be actively maintained but it only has API docs and a few examples. It seems to be primarily written by one engineer at a biotech firm.

Script I ran for the demo

load_dotenv() GlobalConfig.token = os.environ['ARGO_TOKEN_TOKEN'] GlobalConfig.host = "https:///argo" GlobalConfig.namespace = "dev" key = os.environ['AWS_ACCESS_KEY_ID'] secret = os.environ['AWS_SECRET_ACCESS_KEY']

available_instances = [ 'n1-standard-4', 'n1-standard-8', 'nvidia-tesla-k80-x1', 'nvidia-tesla-k80-x2', 'nvidia-tesla-k80-x4', 'nvidia-tesla-k80-x8', 'nvidia-tesla-t4-x1', 'nvidia-tesla-t4-x2', 'nvidia-tesla-t4-x4', 'nvidia-a100-x1', 'nvidia-a100-x2', ]

def machine_info(dst, key, secret): import os import s3fs import multiprocessing import psutil

os.environ['AWS_ACCESS_KEY_ID'] = key
os.environ['AWS_SECRET_ACCESS_KEY'] = secret
fs = s3fs.S3FileSystem()

bucket = 'XXXXXXXXXX'
msg = f"""
Instance Details
================
CPUs : {multiprocessing.cpu_count()}
RAM  : {psutil.virtual_memory().total//(1024.**3):.2f} GB

Available GPUs
--------------
{os.popen('nvidia-smi -L').read()}
"""
fs.write_text(dst, msg) 

def run_workflow(instances, s3_bucket='XXXXXXXX'): dt = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S') output_files = {} with Workflow(f"sweep-demo-{dt}") as w: for instance_type in instances: dst = f"{s3_bucket}/{dt}/perfsweep{instance_type}.txt" output_files[instance_type] = dst if 'nvidia' in instance_type: node, n_gpu = instance_type.split('-x') Task( instance_type, machine_info, [{"dst": dst, "key": key, "secret": secret}], image="pangeo/pytorch-notebook", resources=Resources(gpus=int(n_gpu)), tolerations=[GPUToleration], node_selectors={"cloud.google.com/gke-accelerator": node}, ) else: Task( instance_type, machine_info, [{"dst": dst, "key": key, "secret": secret}], image="pangeo/pytorch-notebook", node_selectors={"beta.kubernetes.io/instance-type": instance_type}, )

w.create()
return output_files

if name == "main": print('\nSelect [red]instance(s)[/red] to include in performance sweep.\n') instances = questionary.checkbox('', choices=available_instances).ask() print('\nSelect [yellow]Software Environment[/yellow] [italic]*Currently not Implemented)[/italic]\n') envs = questionary.select('', choices=['PyTorch-1.13.1', 'PyTorch-2.0', 'TensorFlow-2.11.0']).ask() files = run_workflow(instances) print(f"\n Performance Sweep Started via Argo Workflows using the [yellow]{envs}[/yellow] environment \n\n Running on Instances: \n")

for instance_type, loc in files.items():
    print(f"  - [red]{instance_type}")

print("\n [bold blue]Results Table \n")

table = Table(show_lines=True)
table.add_column("Instance", vertical="middle")
table.add_column("Results")
table.add_column("File")

fs = s3fs.S3FileSystem()

with Live(table, refresh_per_second=4):
    while files:
        for instance_type, fname in list(files.items()):
            if fs.exists(fname):
                result = fs.open(fname, 'r').read()
                table.add_row(f"[red]{instance_type}", f"[blue]{result}", f"{fname}")
                del(files[instance_type])
                continue
rsignell-usgs commented 1 year ago

@benjwadams if you have access to a Nebari deployment for IOOS, great!
If you don't, let me know and I could add you as an authorized user on the ESIP Nebari deployment (https://nebari.esipfed.org)

benjwadams commented 1 year ago

Hey, I haven't yet interacted with Nebari. Have been tinkering with Argo Workflows mainly. We should be able to call workflows via the Argo SDK or HTTP requests.

Please send me some details via email regarding the ESIP Nebari deployment.

dharhas commented 1 year ago

@benjwadams

The official Argo Python SDK is no longer maintained (for over 3 yrs) and folks are being told to use one of the two other packages I mentioned above. Using HTTP requests or the cli will work.

Note: I just found a new SDK that looks a bit better maintained. I will have to test this out.

https://github.com/argoproj/argo-workflows/tree/master/sdks/python

EDIT: Yeah this SDK looks terrible from a docs perspective and end python user perspective. Hera still wins my vote so far.