Shared read access to modeling results

rsignell-usgs commented 7 years ago

We would like modelers to be able to upload large amounts of model output, and have that output be accessible by other JupyterHub users.

When I say large output, I mean each model output is often 10-100GB, so perhaps a reasonable starting point would be 5TB storage?

rsignell-usgs commented 7 years ago

On option in the near future would be to use the HDF Group HSDS to store data https://www.youtube.com/watch?v=EmnCz1Hg-VM which uses S3, and would provide a single URL for each dataset that anyone could use.

Probably can't use this yet, I'm guessing.

isuftin commented 7 years ago

@rsignell-usgs Here's an issue:

JupyterHub runs on a single node (a manager EC2 instance). For each user that logs in, JupyterHub spawns a Docker service (a single Docker container) for that user's notebook on a worker node (a worker EC2 instance). Currently, we have a single worker EC2 instance which all notebooks are created on as Docker containers. On that EC2 instance, it's easy enough to mount a host path to each Docker container and have the path shared between users.

So for example, on the EC2 instance we can have a directory /share/notebooks/common_area and it can be mounted to each Docker container's path at /srv/notebook/commons.

Again, that's pretty simple when it comes to all notebook containers being on the single worker EC2 instance.

However, this is meant to be scaled. So now consider when we expand this to 2+ worker instances and users logging in can have their containers placed into any one of those instances. Now we need a common area between all those instances. One solution is a networked filesystem between all the worker EC2 instances. This was already brought up on another issue ( https://github.com/USGS-CMG/data-life-cycle-cloud-docker-jupyterhub/issues/11 ). However, I'm not positive about what sort of performance we will be seeing using something like NFS if the model output is going to be used via random read/write access.

What format does the model output take? One large binary? Multi-part binary that can be indexed and read easily by modeler clients? Will the data be constantly updated? Will the data need to be duplicated in both the common area as well as the modeler's notebook in order to be worked on?

rsignell-usgs commented 7 years ago

The output from a single simulation is usually a collection of netcdf3 files. These are binary, machine independent files. For the testing with HSDS, we converted the output to netcdf4, which is really hdf5 beneath the hood. Then used "hsload" to "load" the hdf5 file onto s3 in a manner compatible with HSDS. This automatically chunks the data, and each chunk goes into a separate S3 object. See these notebooks for more details and examples: https://github.com/HDFGroup/hdfcloud_workshop cc @jreadey

isuftin commented 7 years ago

@rsignell-usgs So if two users are working with 100GB of data and it's in S3, does that mean that each user would need to duplicate 100GB of that data in their notebook to work with it or is the process more of a pull data chunk, process, delete, pull next chunk type flow?

I'm wondering if scaling users also means having to scale storage space at the same rate if using remotely accessed S3 data.

rsignell-usgs commented 7 years ago

@isuftin , exactly right: it's a pull data chunk, process, delete, pull next chunk type flow. And HSDS caches requests also so doesn't always have to go back to S3. Not sure how the cache is controlled (e.g. how long and how much) though.

isuftin commented 7 years ago

@rsignell-usgs Good. So that means that the local Docker container volumes size won't have to scale like crazy.

isuftin commented 7 years ago

@rsignell-usgs Are you able to test currently whether you're able to run through most of the installation for https://github.com/HDFGroup/hdfcloud_workshop in your notebook?

rsignell-usgs commented 7 years ago

@isuftin, you mean on our CHS JupyterHub, right? I haven't tried that yet, but it's a good idea. I'll try it now!

isuftin commented 7 years ago

@rsignell-usgs I assume you might get up to the point of plugging in IPs and user/passwords. And I assume for that, it requires a server serving the data? (THREDDS?). I'm trying to connect the pieces.

rsignell-usgs commented 7 years ago

@isuftin , yes, this requires a username/password because we are actually accessing data from the HDFgroup's AWS account when we use those notebooks. For now, we could experiment with data stored on their HSDS services, but eventually yes, we would want to install HSDS on CHS, as we have a requirement to store data on USGS controlled computer systems. And to see if/when that might be feasible, we need to hear from @jreadey.

isuftin commented 7 years ago

@rsignell-usgs Gotcha. Then this means you could try this all the way through. Eventually, we will probably want to include this library you're installing as part of the base image if you expect users to be doing this all the time.

rsignell-usgs commented 7 years ago

@isuftin , yes, we can test all the way through. And it also means that you don't need to do anything on this front just now! :smile_cat:

isuftin commented 7 years ago

👍

rsignell-usgs commented 7 years ago

Hmmm.... perhaps the service is current down or changed. I'll contact @jreadey.

I did:

conda env create -f hsds.yaml

where hsds.yaml is:

# hsds environment file
name: hsds
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.6
  - h5py
  - jupyter
  - nb_conda_kernels
  - pytz
  - urllib3
  - requests
  - pip:
      - git+https://github.com/HDFGroup/h5pyd.git

But the hsinfo command failed:

(hsds) jovyan@80fe71604b40:~/github$ hsinfo -e http://52.25.101.15:5101
endpoint: http://52.25.101.15:5101
2017-08-25 18:02:32,976 connection error: HTTPConnectionPool(host='52.25.101.15', port=5101): Ma
x retries exceeded with url: /about (Caused by NewConnectionError('<urllib3.connection.HTTPConne
ction object at 0x7faaf9700da0>: Failed to establish a new connection: [Errno 110] Connection ti
med out',))
Error: Connection Error

rsignell-usgs commented 7 years ago

I just spoke with John, who is on vacation in Greece. He says that server is probably currently down because NREL (the sponsor) moved it to their resources. He can fix when he returns, but unfortunately that's after Labor Day. I asked about installing ourselves, and I guess HDFGroup isn't releasing the source code yet. It's on private github repo and to get access we would have to sign a NDA. I'm not sure we can even do that. :hankey:

isuftin commented 7 years ago

💩 indeed. I wonder what the timeline is for the HDFGroup is to release.

On the bright side, Labor Day is pretty close and we've gotten this far.

jreadey commented 7 years ago

Hey I'm back...

The NREL instance is back online, but the storage bucket has been changed. We're in the midst of moving the instance from the HDF Group AWS account to the NREL account, so things are a bit wonky.

@rsignell-usgs - you are a XSEDE jetstream user, correct? I've just setup HSDS on jetstream (using Ceph object storage). Would you be ok with using that as a test instance? It should be a bit more stable and there's no billing issues to worry about.

In your .hscfg file change the hs_endpoint line to this: hs_endpoint = http://149.165.157.109:5101

Your username and password will be the same.

rsignell-usgs commented 7 years ago

@jreadey , welcome back! :smile_cat:

I tried our your Jetstream HSDS instance from Jetstream, and it works fine: https://gist.github.com/rsignell-usgs/cb4cc8781c1374173a54fe3b716ff291 I have to figure out some meaningful tests, because these clearly are too fast to gain any insight from, but at least it's interesting that OPeNDAP and HSDS access are in the same ballpark.

These are the first time the data is requested, before it's been cached.

It still would be nice to have an AWS instance to try, however, since USGS Cloud is on AWS, just for the sake of comparison to HSDS access from JetStream.

USGS-CMG / data-life-cycle-cloud-docker-jupyterhub

Shared read access to modeling results #14