Closed zonca closed 4 years ago
@pibion let's try to decide what is the best strategy to host the data.
We cannot use the standard Openstack volumes because they do not support multi-attach to multiple instances.
1) One option could be to use Manila on Jetstream which provides a NFS service which is handled by Openstack so we don't have to manage it. This provides a standard read/write filesystem we can mount on all pods.
2) Or deploy our own NFS server, actually we can probably use the NFS server we use for CVMFS to also serve this 50GB volume read/write.
3) Better, especially for distributed computing with dask, would be to use object store (like Amazon S3), which is automatically accessible by all pods. But to make best use of this we should store data in Zarr see https://zonca.github.io/2018/03/zarr-on-jetstream.html
A few questions:
If you agree, I would like to try object store first, that would be the most natural data store for a cloud deployment, and is common also for Pangeo.
can you load some sample data using object store on jetstream? Login to Horizon: https://iu.jetstream-cloud.org/project/containers/
Create a container, public if possible, then upload some raw and processed files.
Then can you provide a snippet of Python to read both kind of data into arrays (assuming local storage, I'll adapt it to read from object store)? a notebook is best, upload to gist.github.com and then link here.
That sounds good. Public data means I'll need to coordinate with the collaboration, I'll let you know if that will take longer than a week.
Okay, a small data set is uploaded to a public container.
This repository linked below contains code that reads in CDMS data sets and also has examples. The tutorial that uses the uploaded data is examples/LoadandPlot.ipynb
.
Repository: http://titus.stanford.edu:8080/git/summary/?r=Analysis/pyCAP.git
@zonca I have a user who's interested in working on a data set that's approximately a TB. I think the current allocation is for 500 GB. For now he's going to work on some smaller data sets, but I wanted to ask if a TB data set might be possible.
Is there another resource I should request for larger data sets?
@jlf599 is space on object store on Jetstream metered? if so, how do we ask for an allocation of a couple of TB?
@zonca -- the object store has quotas like block store does, though they are set separately. If the allocation doesn't have a storage allocation at all, they'll need to request it (http://wiki.jetstream-cloud.org/Requesting+additional+SUs+or+storage+for+Jetstream+--+Supplemental+Allocations). If they already have a storage allocation, they'll need to open a ticket requesting object store access and how much of their storage quota they want dedicated to object store.
thanks @jlf599!
@pibion, please contact the XSEDE helpdesk and ask for extra 2 TB on object store and keep the 500GB on block store.
@zonca do you know if this would be a "Jetsream Storage" supplemental request?
@pibion -- if you do not have storage for your allocation, yes, it would be.
@pibion you already have storage, so you should go through the help desk, not the supplemental request
@zonca thanks, I've submitted a request.
@zonca https://iu.jetstream-cloud.org/project/containers/ uses TACC login credentials, correct?
I'm trying to set up access so another person in our collaboration can add data.
yes, I think you should also add them to your XSEDE allocation, see https://iujetstream.atlassian.net/wiki/spaces/JWT/pages/31391748/After+API+access+has+been+granted
It would be useful to first test with a smaller dataset, like 10/20 GB.
The most important thing is that your software is modified so it can read data directly from object store, we do not want to download data from object store to the local disk.
see an example at: https://zonca.dev/2019/01/zarr-on-jetstream.html if you share a sample short notebook that accesses your data, I can take a look.
@zonca my collaborator @ziqinghong has access to the data store, and now we're wondering if there's a way to upload data through the terminal.
Our raw data is typically many small files and there doesn't appear to be a way to upload several files at once through the web interface. Maybe this is something the openstack API can help with?
If we can get something like globusonline integrated that'll be fantastic. Though I'm feeling that I'm dreaming too much... Thank you @zonca !
yes, sure, the more direct way is:
openstack
python client, see the first part of https://zonca.dev/2019/06/kubernetes-jupyterhub-jetstream-magnum.htmlopenstack object create data_store local_file.root
openstack object create data_store/newfolder local_file.root
, no need to create folder in advanceOtherwise, you can use any tool built for S3, and create ec2 style credentials following: https://zonca.dev/2018/03/zarr-on-jetstream.html you still need the openstack client
do you have a globusonline pro license, the one that supports HTTP endpoints? if so we could try to use that directly...
@zonca @bloer @ziqinghong I don't believe we have a globusonline pro license.
@zonca we'll try the openstack python client, thanks for the information!
also, @pibion @ziqinghong, don't waste too much time uploading data, just the minimum necessary for a reasonable test, we might find out that object store is too difficult to use from your software.
SLAC seems to have pro license? Our data is optimized so that 1 second of data is one file. :-D 10 G of data is thousands of files...
it would be useful to have a Notebook in a gist that does a typical analysis on a standard smallish but still meaningful dataset, and that dataset both as single files and also as a single tgz on object store. So we can do a benchmark comparing running off object store or downloading the tgz and running locally.
@zonca I'd recommend http://titus.stanford.edu:8080/git/summary/?r=Analysis/scdmsPyTools.git, specifically demo/IO/demoIO.ipynb.
That notebook works with the files already uploaded.
A word of warning - it seems that the very first imports that pull in CDMS python packages are failing. This is probably because the CVMFS environment doesn't install those. @bloer is the authority on this, though.
@pibion moved discussion about the Python environment to #12 We need to solve that before we keep working on data access.
Just a quick note on the transfer protocols SLAC supports:
while waiting on #12, I will try to make a IO test using the supercdms/cdms-jupyterlab:1.8b
image on Jetstream but outside of JupyterHub.
Sorry this might be a naive question.. If I copy a file to data_store, where can I see it in jupyter? Thanks! @zonca
@zonca we're having some trouble using the openstack client to add files to the data store. I've installed openstack 3.16 into a python 3.7 conda environment and get the following error when I try to openstack object create
:
(openstack316) aroberts@rhel6-64a:data> openstack object create data_store/pyTools_reference_data/SLAC/R51/Raw/09190321_1522/ pyTools-reference-data/SLAC/R51/Ra
w/09190321_1522/09190321_1522_F0001.mid.gz
Unable to establish connection to https://tacc.jetstream-cloud.org:8080/swift/v1
/data_store/pyTools_reference_data/SLAC/R51/Raw/09190321_1522//pyTools-reference
-data/SLAC/R51/Raw/09190321_1522/09190321_1522_F0001.mid.gz: ('Connection aborte
d.', OSError("(32, 'EPIPE')"))
I do get a list of available images when I try openstack image list
.
@pibion it looks like it is using twice the path, both the remote folder and also the local path, so I think you should try cd
into the folder and do:
openstack object create data_store/pyTools_reference_data/SLAC/R51/Raw/09190321_1522/ 09190321_1522_F0001.mid.gz
see openstack object create --help
@zonca the above command gives the error
[Errno 21] Is a directory: 'pyTools-reference-data/SLAC/R51/Raw/09190321_1522/'
Based on
usage: openstack object create [-h] [-f {csv,json,table,value,yaml}]
[-c COLUMN]
[--quote {all,minimal,none,nonnumeric}]
[--noindent] [--max-width <integer>]
[--fit-width] [--print-empty]
[--sort-column SORT_COLUMN] [--name <name>]
<container> <filename> [<filename> ...]
Upload object to container
positional arguments:
<container> Container for new object
<filename> Local filename(s) to upload
I also tried
(openstack316) aroberts@rhel6-64m:data> openstack object create data_store pyTools-reference-data/SLAC/R51/Raw/09190321_1522/09190321_1522_F0001.mid.gz
But got the error
Unable to establish connection to https://tacc.jetstream-cloud.org:8080/swift/v1/data_store/pyTools-reference-data/SLAC/
R51/Raw/09190321_1522/09190321_1522_F0001.mid.gz: ('Connection aborted.', OSError("(32, 'EPIPE')"))
can you run again this one
openstack object create data_store/pyTools_reference_data/SLAC/R51/Raw/09190321_1522/ 09190321_1522_F0001.mid.gz
I seem to get the same error (I've cd'd into the directory containing the file):
(openstack316) aroberts@rhel6-64m:09190321_1522> openstack object create data_store/pyTools_reference_data/SLAC/R51/Raw/
09190321_1522/ 09190321_1522_F0001.mid.gz
Unable to establish connection to https://tacc.jetstream-cloud.org:8080/swift/v1/data_store/pyTools_reference_data/SLAC/
R51/Raw/09190321_1522//09190321_1522_F0001.mid.gz: ('Connection aborted.', OSError("(32, 'EPIPE')"))
how big is the file?
Let's try with a better tool, I wrote some notes on how to use s3cmd
:
https://zonca.dev/2020/04/jetstream-object-store.html
again, don't spend too much time on object store, we might discover we cannot use it. I am trying to make a test. It will take some time.
@zonca the file is 116 MB. I'll look into s3cmd.
ok, I uploaded with s3cmd
the file you sent me to:
s3cmd ls s3://data_store/raw/*
2020-04-02 00:06 115M s3://data_store/raw/09190321_1522_F0001.mid.gz
Okay, I’ve also successfully uploaded a file now as well with the command
s3cmd put 09190321_1522_F0001.mid.gz s3://data_store/pyTools-reference-data/SLAC/R51/Raw/09190321_1522/
So it seems the issue was indeed with my key.
ok, I have tested data access myself,
It looks like the getRawEvents(filepath,series)
function has strong assumptions about accessing a POSIX filesystem, and the code looks quite complicated, so I would stop here testing the object store and switch to one of the other solutions.
@pibion if they approved your extra storage, can you ask them to move it to block store instead? we will use a standard volume.
Okay, it looks like I need to fill out a supplement request for the additional storage. I'll get that going.
Data is currently all stored at SLAC. There is a "data catalog" python library that allows users to query for data paths. If they don't exist locally they're downloaded to disk.
Is it possible to have a storage disk that is mounted to everyone's container? For initial testing 50 GB would be more than enough. If we want to try to support full CDMS analysis efforts that's more like 10 TB.
Originally posted by @pibion in https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/7#issuecomment-595911213