det-lab / jupyterhub-deploy-kubernetes-jetstream

CDMS JupyterHub deployment on XSEDE Jetstream
0 stars 1 forks source link

host a database on XSEDE and access from jupyter environment? #61

Closed pibion closed 2 years ago

pibion commented 2 years ago

We have a large database that has metadata for a large chunk of data and it's necessary for analysis. Does XSEDE host databases? And would it be possible to access that within our CDMS environment?

I've also reached out to the OSN to see if they have a service that would help here!

pibion commented 2 years ago

Update: the OSN does not have a service that helps with this (although I have heard that they can run containers, so maybe with more effort it'd be possible).

zonca commented 2 years ago

It would be quite straightforward and also flexible to deploy a container inside kubernetes with a postgresql database. However, let me first ask if Jetstream has already a hosted solution.

What are the requirements in terms of storage? Just order of magnitude. Few GB, few hundreds of GB, or few TB?

Is there any other specific requirement on the database? Do you need redundancy? High availability? Backup?

On Wed, Feb 2, 2022, 14:41 pibion @.***> wrote:

Update: the OSN does not have a service that helps with this (although I have heard that they can run containers, so maybe with more effort it'd be possible).

— Reply to this email directly, view it on GitHub https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/61#issuecomment-1028426345, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4V7OB3GSXWHJR5TWKTUZGXKTANCNFSM5NNKQD2A . You are receiving this because you are subscribed to this thread.Message ID: <det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/61/1028426345@ github.com>

zonca commented 2 years ago

Also are you really running database operations like join queries? Or could you store that data into object store?

On Wed, Feb 2, 2022, 16:33 Andrea Zonca @.***> wrote:

It would be quite straightforward and also flexible to deploy a container inside kubernetes with a postgresql database. However, let me first ask if Jetstream has already a hosted solution.

What are the requirements in terms of storage? Just order of magnitude. Few GB, few hundreds of GB, or few TB?

Is there any other specific requirement on the database? Do you need redundancy? High availability? Backup?

On Wed, Feb 2, 2022, 14:41 pibion @.***> wrote:

Update: the OSN does not have a service that helps with this (although I have heard that they can run containers, so maybe with more effort it'd be possible).

— Reply to this email directly, view it on GitHub https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/61#issuecomment-1028426345, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4V7OB3GSXWHJR5TWKTUZGXKTANCNFSM5NNKQD2A . You are receiving this because you are subscribed to this thread.Message ID: <det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/61/1028426345@ github.com>

pibion commented 2 years ago

The types of queries we run tend to be FOR VALUE IN RANGE(LOW, HI) RETURN TEMP1. The structure of the database is a set of simple tables, each with many fields. We tend to be querying only one of those tables at a time. Does that sound like it would work in object store?

The storage requirement is order hundreds of GB.

We have backup elsewhere. High availability and support for many connections would be nice, but I'm not sure how many people will be accessing - we might be in the range of 1 to 5 analyzers connecting at any given time. Database knowledge in our collaboration is slim, though, and people may be making queries that return a large dataset.

zonca commented 2 years ago

what about a columnar data format in object store? one of the more optimized is parquet, and you can access it directly with pandas. https://arrow.apache.org/docs/python/parquet.html

Or Zarr https://zarr.readthedocs.io/en/stable/, also easily read with dask.

Maybe it is worth doing a test with a typical dataset and compare performance with an existing database. I think the main concern is if you are accessing tiny pieces of huge files.

zonca commented 2 years ago

if it is easier, you can give me a dump of 1 of those tables and 1 Python script with a typical data access pattern and what is the expected performance. I can then test the columnar data formats.

pibion commented 2 years ago

@zonca a columnar data format is an interesting idea! @thathayhaykid is working on uploading a dump of the entire database to the OSN, would that work for access for you? We could also do a dump just of the fridge data. We have two tables that we use primarily.

It's possible we're in the "tiny piece of a huge file" but I suspect we could avoid this easily. I'll see if I can bring the student who will be using this data for analysis over to this conversation.

zonca commented 2 years ago

Sure OSN works. Also a sample script is going to be necessary. It just needs to access the data, It doesn't need to do the actual data processing.

pibion commented 2 years ago

@zonca a script is no problem - it may take a week or two, though. My group is just starting to look at this data.

The query is pretty simple, though, so I'm not worried this time estimate will slip too much.

pibion commented 2 years ago

@sukeerthiD this conversation might be of interest!

zonca commented 2 years ago

@pibion any update on this?

pibion commented 2 years ago

@zonca SLAC was able to get some of the tables accessible on their system. @thathayhaykid is working on getting the full database from FNAL (this takes a while because we need someone at FNAL to run the command).

If this can be accessed at SLAC we may not need it web-accessible - I think it makes sense to wait and see. Should I close this issue and re-open if it turns out we do need it?

zonca commented 2 years ago

Better just wait and leave this open for tracking purposes.

pibion commented 2 years ago

@zonca we are again seeing a need for this information on jetstream. It would be ideal to keep it as a database because then we could use the scripts that get the information without modification.

But if you think that hosting it as a file would be significantly easier, we can do that - the operations aren't super complex and I think it's better to have the data available and have to re-write scripts than to get totally stuck because we don't have the data at all.

What do you think about difficulty of hosting a file vs. hosting a database?

zonca commented 2 years ago

what database software do you have at SLAC? how are the scripts written (language, framework if any)?

pibion commented 2 years ago

It's not too complex. The database is Oracle and we use a python SQL library, sqlalchemy.

zonca commented 2 years ago

ok, good, if you are going through sqlalchemy changing backend database should not be a problem.

so we could install PostgreSQL inside Kubernetes with a volume attached on Jetstream 2.

Then you can dump the content of Oracle either with https://ora2pg.darold.net/start.html or via sqlalchemy, or any other way the Oracle admins recommend to a SQL file I can then import into PostgreSQL.

zonca commented 2 years ago

for testing maybe you could get a dump of the first 100 lines of each table or something similar.

pibion commented 2 years ago

We should be able to get a dumpfile of the first X entries in each table. This will take me a little while, I'm currently trying to get access to the system with the database. :/

pibion commented 2 years ago

Success! See file at https://drive.google.com/file/d/1M1ngnhk-I-PtkfcXPF50IxmHiqBL9Cx9/view?usp=sharing.

zonca commented 2 years ago

it seems this is a dump from MariaDB and not Oracle. Should I use a MariaDB container then?

On Thu, Jun 2, 2022 at 4:30 PM pibion @.***> wrote:

Success! See file at https://drive.google.com/file/d/1M1ngnhk-I-PtkfcXPF50IxmHiqBL9Cx9/view?usp=sharing .

— Reply to this email directly, view it on GitHub https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/61#issuecomment-1145435563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4RNEUFE4JWVQDUOR4DVNE7YLANCNFSM5NNKQD2A . You are receiving this because you were mentioned.Message ID: <det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/61/1145435563@ github.com>

pibion commented 2 years ago

@zonca probably, but maybe wait a bit for confirmation? I'm checking with our database-adjacent folks to make sure MariaDB is what's expected here.

pibion commented 2 years ago

@zonca update: the original database is MySQL and we're not yet sure what was used to host it at SLAC (which was where I made the current dumpfile). How do you feel about trying a MariaDB container?

zonca commented 2 years ago

ok, it is deployed on Jetstream, I've imported the dump, usual notes:

https://zonca.dev/2022/06/mariadb-jetstream2-kubernetes.html

I can connect via sqlalchemy and I see the data, I put the notebook in the "secrets" repo because it has the password in clear:

https://github.com/pibion/jupyterhub-deploy-kubernetes-jetstream-secrets/blob/master/mariadb/access_mariadb.ipynb

Anyway it is not accessible from outside.

pibion commented 2 years ago

@zonca fantastic, I'll be able to test it on Thursday morning. The full database is large (80 GB at least?) and its dumpfile is already on the OSN.

zonca commented 2 years ago

Can you please copy it to the data folder on Jetstream2?

zonca commented 2 years ago

Also how big should I get the volume for? The data is 100 GB enough or it's going to grow more?

pibion commented 2 years ago

@zonca yes, I'll copy it to the data folder on Jetstream2.

The data won't grow - it's from an experiment that's been decommissioned (but the data is still being analyzed). But I'm not sure how large the unpacked file will be. So maybe 200 GB?

pibion commented 2 years ago

@zonca it would be easier to copy the file from the OSN if we have rclone available. I can't seem to find the appropriate Dockerfile (?) to edit, could you point me to it?

zonca commented 2 years ago

See https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/blob/master/DEPLOY.md#jupyter-notebook-single-user-image

Consider that there is no more automatic building of the thing, so I will have to build it manually on my machine, push it and then update the deployment.

Anyway, that's the same thing I just did to install the MariaDB drivers.

pibion commented 2 years ago

@zonca I think the dumpfile is too large for the remaining space on our data volume! I tried

rclone copy OSN:supercdms-data/backups/SoudanTableDump backups/SoudanTableDump

but got the error

2022/06/16 23:07:31 ERROR : Attempt 1/3 failed with 1 errors and: multipart copy: write failed: write /cvmfs/data/backups/SoudanTableDump/SoudanTableDump: no space left on device
zonca commented 2 years ago

how big is it?

On Thu, Jun 16, 2022 at 4:13 PM pibion @.***> wrote:

@zonca https://github.com/zonca I think the dumpfile is too large for the remaining space on our data volume! I tried

rclone copy OSN:supercdms-data/backups/SoudanTableDump backups/SoudanTableDump

but got the error

2022/06/16 23:07:31 ERROR : Attempt 1/3 failed with 1 errors and: multipart copy: write failed: write /cvmfs/data/backups/SoudanTableDump/SoudanTableDump: no space left on device

— Reply to this email directly, view it on GitHub https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/61#issuecomment-1158248872, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4UNDOVEMHPXXDRUM2TVPOYIRANCNFSM5NNKQD2A . You are receiving this because you were mentioned.Message ID: <det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/61/1158248872@ github.com>

pibion commented 2 years ago

@zonca just over 115 GB.

zonca commented 2 years ago

ok, it is a bit convoluted, I created a volume 150 GB in kubernetes, mounted it in a Python container where I installed the AWS client, now I am downloading the dump. Next I'll mount that volume to a MariaDB client container and try loading the dump into the DB.

zonca commented 2 years ago

@pibion ok, it worked fine, I see 142 million rows in one of the tables, see my test with details on how to connect:

https://github.com/pibion/jupyterhub-deploy-kubernetes-jetstream-secrets/blob/master/mariadb/access_mariadb.ipynb

I'll keep the volume with the dump around in case we need to restore.

zonca commented 2 years ago

@pibion can you access the db?

zkromerUCD commented 2 years ago

@zonca I was able to run the Jupyter notebook and access the database; the results from my running it look the same as yours.

zonca commented 2 years ago

Very good. I will close this issue. If you have any problem or you need some other feature please open a separate issue