ESIPFed / esiphub-dev

Development JupyterHub on AWS targeting pangeo environment for National Water Model exploration
MIT License
2 stars 1 forks source link

Explore FUSE access to original NWM NetCDF files on S3 #8

Closed rsignell-usgs closed 6 years ago

rsignell-usgs commented 6 years ago

I just heard on a pangeo web meeting that the Met Office developed a FUSE toolbox that you can use to mount all your s3 content as file system: https://github.com/informatics-lab/s3-fuse-flex-volume/blob/master/README.md

We should enable this so we can compare this baseline to other approaches like zarr and HSDS.

zflamig commented 6 years ago

https://github.com/IntelAI/experimental-kvc

This might be another good option to look at

rsignell-usgs commented 6 years ago

@zflamig have you experimented with this yet?
Are you getting enhanced performance relative to FUSE?

zflamig commented 6 years ago

@rsignell-usgs Not yet... got distracted with other things unfortunately.

rsignell-usgs commented 6 years ago

http://pangeo.esipfed.org users now have access to any public-read S3 bucket via /s3/<bucket>, following the Met Office approach:

  1. Install this cluster-level plugin: https://github.com/informatics-lab/s3-fuse-flex-volume
  2. modify the jupyter-config.yaml to include the storage section as directed and update the pangeo helm chart.
  3. Make sure the IAM policy on the K8 nodes have the AmazonS3ReadOnlyAccess attached.

So now we can see the National Water Model data at:

ls /s3/noaa-nwm-pds/
rsignell-usgs commented 6 years ago

It turns out my notebook is seeing /s3, but the dask workers are not.

rsignell-usgs commented 6 years ago

So min r-k helped me fix this problem. Hurrah for the scipy 2018 code sprint! When he found out the notebook user pod was working, he had me dump the parameters, and then steal the settings from there to populate the custom-worker-template.yaml, which now looks like this:

metadata:
spec:
  restartPolicy: Never
  volumes:
    - flexVolume:
        driver: informaticslab/pysssix-flex-volume
        options:
          readonly: "true"
      name: s3
  containers:
  - args:
      - dask-worker
      - --nthreads
      - '2'
      - --no-bokeh
      - --memory-limit
      - 6GB
      - --death-timeout
      - '60'
    image: esip/pangeo-notebook:2018-07-04
    name: dask-worker
    securityContext:
      capabilities:
        add: [SYS_ADMIN]
      privileged: true

    volumeMounts:
    - mountPath: /s3
      name: s3
    resources:
      limits:
        cpu: "1.75"
        memory: 6G
      requests:
        cpu: "1.75"
        memory: 6G

We found out what the notebook pod was using by first doing:

kubectl get pods -n esip-dev | grep jupyter

to find my user pod, and then ran this command to dump the info to json:

kubectl get pod -o yaml -n esip-dev jupyter-rsignell-2dusgs > foo.yaml