Where should the collection JSON files live on the web?

charlesbluca commented 5 years ago

I'm currently working on the CMIP6 data browser hosted here; I would like to be able to point to the master version of CMIP6's associated JSON file and any others that are currently out there - some options for storing these files are:

Sitting in a cloud bucket we maintain
Hosted within Pangeo Cloud Datastore repo
On an NCAR server

This doesn't have to be where the CSV files are stored, considering they may be updated more often.

Do we want to have an online browser for NCAR's data stored on GLADE? This wouldn't allow users to interact with the data but could still give a good overview of what data is available on disk vs. cloud.

matt-long commented 5 years ago

We currently have a "catalogs" directory in our project template: https://github.com/cmip6hack/project-template

but this doesn't seem like the optimal place, since the collections have broader appeal.

@jhamman, @andersy005: should the "catalogs" directory be a subtree?

andersy005 commented 5 years ago

Do we want to have an online browser for NCAR's data stored on GLADE?

@charlesbluca, This would be awesome. One thing we can do is to expose the csv file from a public FTP server. Would this work with the browser you are working on?

Currently this file is more than 100MB which makes it impossible to store in a GitHub repository.

abanihi at casper02 in /glade/collections/cmip/catalog
$ ls -ltrh
total 121M
-rw-r--r-- 1 abanihi cmipdata 121M Oct  8 15:21 glade-cmip6.csv

I am not sure what's the right approach to adopt for these massive catalog files.

rabernat commented 5 years ago

Currently this file is more than 100MB which makes it impossible to store in a GitHub repository.

😱

Curious how much it compresses if you gzip it? We can easily open gzipped csv from pandas, not sure about javascript. A more efficient and totally compatible storage format would be parquet.

It's important to distinguish between the JSON file and the CSV file. The json file is tiny and can live anywhere. The CSV file is potentially huge and needs to be updated frequently.

I would vote for putting the JSON files in https://github.com/pangeo-data/pangeo-datastore. They can point to CSV files elsewhere on the web.

rabernat commented 5 years ago

To expand a bit, what I would really like is to be able to point the hackathon participants to a single location on the web where they can view all of the data and decide whether to use cloud or cheyenne.

matt-long commented 5 years ago

The CSV files tend to compress very well; factors of 20 or more reduction.

For provenance purposes, it seems like it would be ideal to have the json and csv in a project repo, perhaps as a subtree. pangeo-datastore has a lot of other stuff, however, and is explicitly cloud focused.

Would it make sense to have a CMIP-datastore repo?

andersy005 commented 5 years ago

Curious how much it compresses if you gzip it?

It went from 120MB to 4MB 😀

abanihi at casper03 in /glade/collections/cmip/catalog
$ ls -ltrh
total 4.2M
-rw-r--r-- 1 abanihi cmipdata 4.2M Oct  8 15:21 glade-cmip6.csv.gz

rabernat commented 5 years ago

I think the CSV files are too big to put in git / github. We are talking about millions of rows.

rabernat commented 5 years ago

@charlesbluca - do you know if your javascript stuff can handle opening a .csv.gz file?

charlesbluca commented 5 years ago

Not sure! I’m sure I can find a way to decompress the gzipped file before parsing if PapaParse doesn’t do this natively.

charlesbluca commented 5 years ago

Anderson, I should be able to parse a CSV if it is made available via FTP; if you drop a path where it can be accessed here I can make a page to represent the data available on GLADE (with a view of the metadata based on the example JSON file in this repo).

andersy005 commented 5 years ago

@charlesbluca, for the time being, I've placed the most recent catalogs (CSVs) for the CMIP6 Data on Glade here: https://github.com/NCAR/intake-esm-datastore/tree/master/catalogs. You will notice that I have two csv files namelyglade-cmip6-dcpp.csv.gz and glade-cmip6.csv.gz. The main reason for this is due to the fact that the decadal prediction dcpp experiments catalog has an additional column: start_year which is not present in the rest. We may need to split the Pangeo's csv catalog into two catalogs in order to accommodate the dcpp experiments (at least, this will be necessary for intake-esm to properly load the dccp data into xarray). @naomi-henderson, what do you think?

I personally think that Github may be a better alternative to an FTP server since we can version control the catalog and retrieve old versions in case something goes wrong.

For now, I am planning on keeping these csv files up to date by making sure that they are in sync with the copies stored on Glade.

Notebook used to build the catalog: https://nbviewer.jupyter.org/github/NCAR/intake-esm-datastore/blob/master/builders/cmip6_catalog_builder.ipynb

charlesbluca commented 5 years ago

@andersy005 it turns out uncompressing gzip in pure JavaScript is harder than it seems; is there any server the unzipped CSV could be provided from? We can work a way out to work with gzipped in the future but for now it would be faster to work with plain CSV.

matt-long commented 5 years ago

We could post these on ftp://ftp.cgd.ucar.edu/archive/aletheia-data

charlesbluca commented 5 years ago

That works! Anywhere it is convenient to host the uncompressed CSV files.

andersy005 commented 5 years ago

@charlesbluca, the uncompressed catalog resides here:

ftp://ftp.cgd.ucar.edu/archive/aletheia-data/intake-esm-datastore/catalogs/glade-cmip6.csv

I am excited about the browser you are working on. Let me know if you have any questions or have issues accessing the CSV.

charlesbluca commented 5 years ago

Thank you! Feel free to leave up the gzipped version, hopefully with more time I can work on a way to process through gzipped CSV.

andersy005 commented 5 years ago

You are welcome!

The gzipped version is still available from the same directory:

(base) -bash-4.2$ ls -ltrh
total 147M
-rw-r--r-- 1 abanihi cgdaletheia 6.1M Oct 14 11:42 glade-cmip6.csv.gz
-rw-r--r-- 1 abanihi cgdaletheia 2.3K Oct 14 11:42 glade-cmip6.json
-rw-r--r-- 1 abanihi cgdaletheia 2.1K Oct 14 11:42 pangeo-cmip6.json
-rw-r--r-- 1 abanihi cgdaletheia 141M Oct 14 11:43 glade-cmip6.csv

charlesbluca commented 5 years ago

Looks like I neglected CORS; to access the catalog files through JavaScript, they will need to be hosted via HTTP/HTTPS - the buckets may be the best place to host the catalogs for now.

@rabernat, do you have any more information on how to give files hosted in the bucket the proper CORS headers? I uploaded the GLADE catalog to the pangeo-cmip6 bucket, but it hasn't seemed to inherit the Access-Control-Allow-Origin header that allows us to use it for the web browser.

charlesbluca commented 5 years ago

After some trial and error, I now know that our CSV parser is capable of handling gzipped files, but it needs the proper header (Content-Encoding) provided alongside the file so it knows to decompress it.

This is relatively simple with Google Cloud; you can edit the metadata for individual files to adjust and add different headers, and adding "Content-Encoding: gzip" to the GLADE catalog allowed it to automatically be decompressed and parsed in a fraction of the time it would've taken to load in the uncompressed catalog.

Is there a way to control the headers of files being hosted via Github? This seems to be the primary difference between hosting on Github versus Google Cloud (other than limitations on file size), and if we could find a way to do this it might be a "best of both worlds" solution to our problem of where to host the catalogs.

andersy005 commented 5 years ago

How about we store the GLADE catalog alongside the Pangeo Catalog in Google Cloud and worry about the Github/headers issue after the hackathon?

rabernat commented 5 years ago

I see that the google cloud bucket now has two items:

@charlesbluca - is this working? Do you need me to set any CORS or mime-type properties on the google cloud bucket?

naomi-henderson commented 5 years ago

@rabernat and @charlesbluca - Uh oh, I hope we are not working at cross-purposes here. I have been keeping this original catalog (in gs://pangeo-cmip6) in sync with the current catalog (in gs://cmip6) so that Ryan's old notebooks will continue to point to a valid catalog. If you need to make changes to https://storage.googleapis.com/pangeo-cmip6/pangeo-cmip6-zarr-consolidated-stores.csv (which lives in gs://pangeo-cmip6) , let me know and I will stop updating it.

rabernat commented 5 years ago

@naomi-henderson - I think you're fine. Thanks for keeping everything up to date!

charlesbluca commented 5 years ago

Made a PR to the datastore to add the GLADE browser - while the site is building these changes can be viewed here.

rabernat commented 5 years ago

The glade catalog is now live here:

https://pangeo-data.github.io/pangeo-datastore/cmip6_glade.html

Awesome work @charlesbluca! It's fantastic to see this milestone

A few comments:

The page is super slow. Probably due to the huge size of the database
There are some license issues related to agGrid showing up in the javascript console

andersy005 commented 5 years ago

@charlesbluca, the catalog browser looks pretty cool!

charlesbluca commented 5 years ago

@rabernat That was one of the first issues I noticed - in the long term, the CSV parser has a way to process through the CSV file by chunks, so we could display the spreadsheet view before actually loading in all of the rows - my only reasoning behind not using it now is because it seems to involve more complex CORS options than we currently have set on the GCS bucket.

rabernat commented 5 years ago

@charlesbluca - that sounds like a great idea.

it seems to involve more complex CORS options than we currently have set on the GCS bucket.

Can you give more details? What do we need to tweak to enable this capability?

charlesbluca commented 5 years ago

It seems like when the CSV chunking is enabled, instead of sending a GET request to Google Cloud, the parser instead sends an OPTIONS request; I tried adding OPTIONS to the list of allowed requests for our pangeo-cmip6 bucket, but still got an error stating that the Access-Control-Allow-Origin header was missing.

charlesbluca commented 5 years ago

Very nearly done with the basic aspects of the viewer! One suggestion is that the JSON collection specifications use the direct link to the catalogs stored on Google Cloud versus their "path"; i.e. gs://cmip6/cmip6-zarr-consolidated-stores.csv to https://storage.googleapis.com/pangeo-cmip6/pangeo-cmip6-zarr-consolidated-stores.csv. This will allow us to get the view of the CSV entirely from attributes of the JSON file, versus having to hardcode it.

Excited to make a PR and get this moving!

rabernat commented 5 years ago

This looks great Charles! Please go ahead with another PR to pangeo-datastore whenever you're ready!

andersy005 commented 5 years ago

@charlesbluca,

We had a lot of data transfers yesterday. So, I updated the glade catalog with the new data assets: https://github.com/NCAR/intake-esm-datastore/blob/master/catalogs/glade-cmip6.csv.gz. You may need to update the browser to use this new version of the catalog

charlesbluca commented 5 years ago

Sure! I will need to move this gzipped catalog into pangeo-cmip6 so it can be served with the proper Content-Encoding, is that okay?

NCAR / esm-collection-spec

Where should the collection JSON files live on the web? #11