Distributed data access - object store

zonca commented 4 years ago

Data is currently all stored at SLAC. There is a "data catalog" python library that allows users to query for data paths. If they don't exist locally they're downloaded to disk.

Is it possible to have a storage disk that is mounted to everyone's container? For initial testing 50 GB would be more than enough. If we want to try to support full CDMS analysis efforts that's more like 10 TB.

Originally posted by @pibion in https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/7#issuecomment-595911213

zonca commented 4 years ago

@pibion let's try to decide what is the best strategy to host the data.

We cannot use the standard Openstack volumes because they do not support multi-attach to multiple instances.

1) One option could be to use Manila on Jetstream which provides a NFS service which is handled by Openstack so we don't have to manage it. This provides a standard read/write filesystem we can mount on all pods.

2) Or deploy our own NFS server, actually we can probably use the NFS server we use for CVMFS to also serve this 50GB volume read/write.

3) Better, especially for distributed computing with dask, would be to use object store (like Amazon S3), which is automatically accessible by all pods. But to make best use of this we should store data in Zarr see https://zonca.github.io/2018/03/zarr-on-jetstream.html

A few questions:

What is the format of the data now?
Is there some natural subdivision of the data? for example you have 1 file per day or per week or per detector
How much data you would like to access concurrently in a distributed computation with dask? Can you stream through it or need all in memory at the same time?
Do you need read/write? or read-only is fine?

pibion commented 4 years ago

The processed data is all ROOT files. The raw data is in a custom binary format. Ideally I'd like to have some of each available - dask gets particularly interesting for raw data since sometimes we need to interactively develop new processing algorithms. I think dask might be well-suited for that.
We subdivide processed data files approximately by weeks; these range from 100 MB to several GB.
I think we could stream through it. Certainly if we're using it for interactive processing of raw data then the IO time will be much less than the (typical) processing.
Read-only is access to the data store is perfect. Users will need to have local space to store notebooks and intermediate files. I think we could limit this to 10 GB without issue. I'm biased, though, I like to analyses that have few intermediate outputs. I'll double-check with others how they typically use disk space during analysis.

zonca commented 4 years ago

If you agree, I would like to try object store first, that would be the most natural data store for a cloud deployment, and is common also for Pangeo.

can you load some sample data using object store on jetstream? Login to Horizon: https://iu.jetstream-cloud.org/project/containers/

Create a container, public if possible, then upload some raw and processed files.

Then can you provide a snippet of Python to read both kind of data into arrays (assuming local storage, I'll adapt it to read from object store)? a notebook is best, upload to gist.github.com and then link here.

pibion commented 4 years ago

That sounds good. Public data means I'll need to coordinate with the collaboration, I'll let you know if that will take longer than a week.

pibion commented 4 years ago

Okay, a small data set is uploaded to a public container.

This repository linked below contains code that reads in CDMS data sets and also has examples. The tutorial that uses the uploaded data is examples/LoadandPlot.ipynb.

Repository: http://titus.stanford.edu:8080/git/summary/?r=Analysis/pyCAP.git

pibion commented 4 years ago

@zonca I have a user who's interested in working on a data set that's approximately a TB. I think the current allocation is for 500 GB. For now he's going to work on some smaller data sets, but I wanted to ask if a TB data set might be possible.

Is there another resource I should request for larger data sets?

zonca commented 4 years ago

@jlf599 is space on object store on Jetstream metered? if so, how do we ask for an allocation of a couple of TB?

jlf599 commented 4 years ago

@zonca -- the object store has quotas like block store does, though they are set separately. If the allocation doesn't have a storage allocation at all, they'll need to request it (http://wiki.jetstream-cloud.org/Requesting+additional+SUs+or+storage+for+Jetstream+--+Supplemental+Allocations). If they already have a storage allocation, they'll need to open a ticket requesting object store access and how much of their storage quota they want dedicated to object store.

zonca commented 4 years ago

thanks @jlf599!

@pibion, please contact the XSEDE helpdesk and ask for extra 2 TB on object store and keep the 500GB on block store.

pibion commented 4 years ago

@zonca do you know if this would be a "Jetsream Storage" supplemental request?

jlf599 commented 4 years ago

@pibion -- if you do not have storage for your allocation, yes, it would be.

zonca commented 4 years ago

@pibion you already have storage, so you should go through the help desk, not the supplemental request

pibion commented 4 years ago

@zonca thanks, I've submitted a request.

pibion commented 4 years ago

@zonca https://iu.jetstream-cloud.org/project/containers/ uses TACC login credentials, correct?

I'm trying to set up access so another person in our collaboration can add data.

zonca commented 4 years ago

yes, I think you should also add them to your XSEDE allocation, see https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/31391748/After+API+access+has+been+granted

It would be useful to first test with a smaller dataset, like 10/20 GB.

The most important thing is that your software is modified so it can read data directly from object store, we do not want to download data from object store to the local disk.

zonca commented 4 years ago

see an example at: https://zonca.dev/2019/01/zarr-on-jetstream.html if you share a sample short notebook that accesses your data, I can take a look.

pibion commented 4 years ago

@zonca my collaborator @ziqinghong has access to the data store, and now we're wondering if there's a way to upload data through the terminal.

Our raw data is typically many small files and there doesn't appear to be a way to upload several files at once through the web interface. Maybe this is something the openstack API can help with?

ziqinghong commented 4 years ago

If we can get something like globusonline integrated that'll be fantastic. Though I'm feeling that I'm dreaming too much... Thank you @zonca !

zonca commented 4 years ago

yes, sure, the more direct way is:

install the openstack python client, see the first part of https://zonca.dev/2019/06/kubernetes-jupyterhub-jetstream-magnum.html
openstack object create data_store local_file.root
or for a subfolder openstack object create data_store/newfolder local_file.root, no need to create folder in advance

Otherwise, you can use any tool built for S3, and create ec2 style credentials following: https://zonca.dev/2018/03/zarr-on-jetstream.html you still need the openstack client

zonca commented 4 years ago

do you have a globusonline pro license, the one that supports HTTP endpoints? if so we could try to use that directly...

pibion commented 4 years ago

@zonca @bloer @ziqinghong I don't believe we have a globusonline pro license.

pibion commented 4 years ago

@zonca we'll try the openstack python client, thanks for the information!

zonca commented 4 years ago

also, @pibion @ziqinghong, don't waste too much time uploading data, just the minimum necessary for a reasonable test, we might find out that object store is too difficult to use from your software.

ziqinghong commented 4 years ago

SLAC seems to have pro license? Our data is optimized so that 1 second of data is one file. :-D 10 G of data is thousands of files...

zonca commented 4 years ago

it would be useful to have a Notebook in a gist that does a typical analysis on a standard smallish but still meaningful dataset, and that dataset both as single files and also as a single tgz on object store. So we can do a benchmark comparing running off object store or downloading the tgz and running locally.

pibion commented 4 years ago

@zonca I'd recommend http://titus.stanford.edu:8080/git/summary/?r=Analysis/scdmsPyTools.git, specifically demo/IO/demoIO.ipynb.

That notebook works with the files already uploaded.

A word of warning - it seems that the very first imports that pull in CDMS python packages are failing. This is probably because the CVMFS environment doesn't install those. @bloer is the authority on this, though.

zonca commented 4 years ago

@pibion moved discussion about the Python environment to #12 We need to solve that before we keep working on data access.

pibion commented 4 years ago

Just a quick note on the transfer protocols SLAC supports:

gridFTP
xrootd
rsync

zonca commented 4 years ago

while waiting on #12, I will try to make a IO test using the supercdms/cdms-jupyterlab:1.8b image on Jetstream but outside of JupyterHub.

ziqinghong commented 4 years ago

Sorry this might be a naive question.. If I copy a file to data_store, where can I see it in jupyter? Thanks! @zonca

pibion commented 4 years ago

@zonca we're having some trouble using the openstack client to add files to the data store. I've installed openstack 3.16 into a python 3.7 conda environment and get the following error when I try to openstack object create:

(openstack316) aroberts@rhel6-64a:data> openstack object create data_store/pyTools_reference_data/SLAC/R51/Raw/09190321_1522/ pyTools-reference-data/SLAC/R51/Ra
w/09190321_1522/09190321_1522_F0001.mid.gz
Unable to establish connection to https://tacc.jetstream-cloud.org:8080/swift/v1
/data_store/pyTools_reference_data/SLAC/R51/Raw/09190321_1522//pyTools-reference
-data/SLAC/R51/Raw/09190321_1522/09190321_1522_F0001.mid.gz: ('Connection aborte
d.', OSError("(32, 'EPIPE')"))

pibion commented 4 years ago

I do get a list of available images when I try openstack image list.

zonca commented 4 years ago

@pibion it looks like it is using twice the path, both the remote folder and also the local path, so I think you should try cd into the folder and do:

openstack object create data_store/pyTools_reference_data/SLAC/R51/Raw/09190321_1522/ 09190321_1522_F0001.mid.gz

see openstack object create --help

pibion commented 4 years ago

@zonca the above command gives the error

[Errno 21] Is a directory: 'pyTools-reference-data/SLAC/R51/Raw/09190321_1522/'

pibion commented 4 years ago

Based on

usage: openstack object create [-h] [-f {csv,json,table,value,yaml}]
                               [-c COLUMN]
                               [--quote {all,minimal,none,nonnumeric}]
                               [--noindent] [--max-width <integer>]
                               [--fit-width] [--print-empty]
                               [--sort-column SORT_COLUMN] [--name <name>]
                               <container> <filename> [<filename> ...]

Upload object to container

positional arguments:
  <container>           Container for new object
  <filename>            Local filename(s) to upload

I also tried

(openstack316) aroberts@rhel6-64m:data> openstack object create data_store pyTools-reference-data/SLAC/R51/Raw/09190321_1522/09190321_1522_F0001.mid.gz

But got the error

Unable to establish connection to https://tacc.jetstream-cloud.org:8080/swift/v1/data_store/pyTools-reference-data/SLAC/
R51/Raw/09190321_1522/09190321_1522_F0001.mid.gz: ('Connection aborted.', OSError("(32, 'EPIPE')"))

zonca commented 4 years ago

can you run again this one

openstack object create data_store/pyTools_reference_data/SLAC/R51/Raw/09190321_1522/ 09190321_1522_F0001.mid.gz

pibion commented 4 years ago

I seem to get the same error (I've cd'd into the directory containing the file):

(openstack316) aroberts@rhel6-64m:09190321_1522> openstack object create data_store/pyTools_reference_data/SLAC/R51/Raw/
09190321_1522/ 09190321_1522_F0001.mid.gz
Unable to establish connection to https://tacc.jetstream-cloud.org:8080/swift/v1/data_store/pyTools_reference_data/SLAC/
R51/Raw/09190321_1522//09190321_1522_F0001.mid.gz: ('Connection aborted.', OSError("(32, 'EPIPE')"))

zonca commented 4 years ago

how big is the file?

zonca commented 4 years ago

Let's try with a better tool, I wrote some notes on how to use s3cmd:

https://zonca.dev/2020/04/jetstream-object-store.html

again, don't spend too much time on object store, we might discover we cannot use it. I am trying to make a test. It will take some time.

pibion commented 4 years ago

@zonca the file is 116 MB. I'll look into s3cmd.

zonca commented 4 years ago

ok, I uploaded with s3cmd the file you sent me to:

s3cmd ls s3://data_store/raw/*
2020-04-02 00:06       115M  s3://data_store/raw/09190321_1522_F0001.mid.gz

pibion commented 4 years ago

Okay, I’ve also successfully uploaded a file now as well with the command

s3cmd put 09190321_1522_F0001.mid.gz s3://data_store/pyTools-reference-data/SLAC/R51/Raw/09190321_1522/

So it seems the issue was indeed with my key.

zonca commented 4 years ago

ok, I have tested data access myself,

It looks like the getRawEvents(filepath,series) function has strong assumptions about accessing a POSIX filesystem, and the code looks quite complicated, so I would stop here testing the object store and switch to one of the other solutions.

@pibion if they approved your extra storage, can you ask them to move it to block store instead? we will use a standard volume.

pibion commented 4 years ago

Okay, it looks like I need to fill out a supplement request for the additional storage. I'll get that going.

det-lab / jupyterhub-deploy-kubernetes-jetstream

Distributed data access - object store #8