Open charlesbluca opened 5 years ago
We currently have a "catalogs" directory in our project template: https://github.com/cmip6hack/project-template
but this doesn't seem like the optimal place, since the collections have broader appeal.
@jhamman, @andersy005: should the "catalogs" directory be a subtree?
Do we want to have an online browser for NCAR's data stored on GLADE?
@charlesbluca, This would be awesome. One thing we can do is to expose the csv file from a public FTP server. Would this work with the browser you are working on?
Currently this file is more than 100MB which makes it impossible to store in a GitHub repository.
abanihi at casper02 in /glade/collections/cmip/catalog
$ ls -ltrh
total 121M
-rw-r--r-- 1 abanihi cmipdata 121M Oct 8 15:21 glade-cmip6.csv
I am not sure what's the right approach to adopt for these massive catalog files.
Currently this file is more than 100MB which makes it impossible to store in a GitHub repository.
😱
Curious how much it compresses if you gzip it? We can easily open gzipped csv from pandas, not sure about javascript. A more efficient and totally compatible storage format would be parquet.
It's important to distinguish between the JSON file and the CSV file. The json file is tiny and can live anywhere. The CSV file is potentially huge and needs to be updated frequently.
I would vote for putting the JSON files in https://github.com/pangeo-data/pangeo-datastore. They can point to CSV files elsewhere on the web.
To expand a bit, what I would really like is to be able to point the hackathon participants to a single location on the web where they can view all of the data and decide whether to use cloud or cheyenne.
The CSV files tend to compress very well; factors of 20 or more reduction.
For provenance purposes, it seems like it would be ideal to have the json and csv in a project repo, perhaps as a subtree. pangeo-datastore has a lot of other stuff, however, and is explicitly cloud focused.
Would it make sense to have a CMIP-datastore repo?
Curious how much it compresses if you gzip it?
It went from 120MB to 4MB 😀
abanihi at casper03 in /glade/collections/cmip/catalog
$ ls -ltrh
total 4.2M
-rw-r--r-- 1 abanihi cmipdata 4.2M Oct 8 15:21 glade-cmip6.csv.gz
I think the CSV files are too big to put in git / github. We are talking about millions of rows.
@charlesbluca - do you know if your javascript stuff can handle opening a .csv.gz
file?
Not sure! I’m sure I can find a way to decompress the gzipped file before parsing if PapaParse doesn’t do this natively.
Anderson, I should be able to parse a CSV if it is made available via FTP; if you drop a path where it can be accessed here I can make a page to represent the data available on GLADE (with a view of the metadata based on the example JSON file in this repo).
@charlesbluca, for the time being, I've placed the most recent catalogs (CSVs) for the CMIP6 Data on Glade here: https://github.com/NCAR/intake-esm-datastore/tree/master/catalogs. You will notice that I have two csv files namelyglade-cmip6-dcpp.csv.gz
and glade-cmip6.csv.gz
. The main reason for this is due to the fact that the decadal prediction dcpp
experiments catalog has an additional column: start_year
which is not present in the rest. We may need to split the Pangeo's csv catalog into two catalogs in order to accommodate the dcpp experiments (at least, this will be necessary for intake-esm to properly load the dccp data into xarray). @naomi-henderson, what do you think?
I personally think that Github may be a better alternative to an FTP server since we can version control the catalog and retrieve old versions in case something goes wrong.
For now, I am planning on keeping these csv files up to date by making sure that they are in sync with the copies stored on Glade.
Notebook used to build the catalog: https://nbviewer.jupyter.org/github/NCAR/intake-esm-datastore/blob/master/builders/cmip6_catalog_builder.ipynb
@andersy005 it turns out uncompressing gzip in pure JavaScript is harder than it seems; is there any server the unzipped CSV could be provided from? We can work a way out to work with gzipped in the future but for now it would be faster to work with plain CSV.
We could post these on ftp://ftp.cgd.ucar.edu/archive/aletheia-data
That works! Anywhere it is convenient to host the uncompressed CSV files.
@charlesbluca, the uncompressed catalog resides here:
ftp://ftp.cgd.ucar.edu/archive/aletheia-data/intake-esm-datastore/catalogs/glade-cmip6.csv
I am excited about the browser you are working on. Let me know if you have any questions or have issues accessing the CSV.
Thank you! Feel free to leave up the gzipped version, hopefully with more time I can work on a way to process through gzipped CSV.
You are welcome!
The gzipped version is still available from the same directory:
(base) -bash-4.2$ ls -ltrh
total 147M
-rw-r--r-- 1 abanihi cgdaletheia 6.1M Oct 14 11:42 glade-cmip6.csv.gz
-rw-r--r-- 1 abanihi cgdaletheia 2.3K Oct 14 11:42 glade-cmip6.json
-rw-r--r-- 1 abanihi cgdaletheia 2.1K Oct 14 11:42 pangeo-cmip6.json
-rw-r--r-- 1 abanihi cgdaletheia 141M Oct 14 11:43 glade-cmip6.csv
Looks like I neglected CORS; to access the catalog files through JavaScript, they will need to be hosted via HTTP/HTTPS - the buckets may be the best place to host the catalogs for now.
@rabernat, do you have any more information on how to give files hosted in the bucket the proper CORS headers? I uploaded the GLADE catalog to the pangeo-cmip6 bucket, but it hasn't seemed to inherit the Access-Control-Allow-Origin header that allows us to use it for the web browser.
After some trial and error, I now know that our CSV parser is capable of handling gzipped files, but it needs the proper header (Content-Encoding) provided alongside the file so it knows to decompress it.
This is relatively simple with Google Cloud; you can edit the metadata for individual files to adjust and add different headers, and adding "Content-Encoding: gzip" to the GLADE catalog allowed it to automatically be decompressed and parsed in a fraction of the time it would've taken to load in the uncompressed catalog.
Is there a way to control the headers of files being hosted via Github? This seems to be the primary difference between hosting on Github versus Google Cloud (other than limitations on file size), and if we could find a way to do this it might be a "best of both worlds" solution to our problem of where to host the catalogs.
How about we store the GLADE catalog alongside the Pangeo Catalog in Google Cloud and worry about the Github/headers issue after the hackathon?
I see that the google cloud bucket now has two items:
@charlesbluca - is this working? Do you need me to set any CORS or mime-type properties on the google cloud bucket?
@rabernat and @charlesbluca - Uh oh, I hope we are not working at cross-purposes here. I have been keeping this original catalog (in gs://pangeo-cmip6) in sync with the current catalog (in gs://cmip6) so that Ryan's old notebooks will continue to point to a valid catalog. If you need to make changes to https://storage.googleapis.com/pangeo-cmip6/pangeo-cmip6-zarr-consolidated-stores.csv
(which lives in gs://pangeo-cmip6) , let me know and I will stop updating it.
@naomi-henderson - I think you're fine. Thanks for keeping everything up to date!
Made a PR to the datastore to add the GLADE browser - while the site is building these changes can be viewed here.
The glade catalog is now live here:
Awesome work @charlesbluca! It's fantastic to see this milestone
A few comments:
@charlesbluca, the catalog browser looks pretty cool!
@rabernat That was one of the first issues I noticed - in the long term, the CSV parser has a way to process through the CSV file by chunks, so we could display the spreadsheet view before actually loading in all of the rows - my only reasoning behind not using it now is because it seems to involve more complex CORS options than we currently have set on the GCS bucket.
@charlesbluca - that sounds like a great idea.
it seems to involve more complex CORS options than we currently have set on the GCS bucket.
Can you give more details? What do we need to tweak to enable this capability?
It seems like when the CSV chunking is enabled, instead of sending a GET request to Google Cloud, the parser instead sends an OPTIONS request; I tried adding OPTIONS to the list of allowed requests for our pangeo-cmip6 bucket, but still got an error stating that the Access-Control-Allow-Origin header was missing.
Very nearly done with the basic aspects of the viewer! One suggestion is that the JSON collection specifications use the direct link to the catalogs stored on Google Cloud versus their "path"; i.e. gs://cmip6/cmip6-zarr-consolidated-stores.csv
to https://storage.googleapis.com/pangeo-cmip6/pangeo-cmip6-zarr-consolidated-stores.csv
. This will allow us to get the view of the CSV entirely from attributes of the JSON file, versus having to hardcode it.
Excited to make a PR and get this moving!
This looks great Charles! Please go ahead with another PR to pangeo-datastore whenever you're ready!
@charlesbluca,
https://github.com/NCAR/intake-esm-datastore/blob/master/catalogs/glade-cmip6.csv.gz
. You may need to update the browser to use this new version of the catalogSure! I will need to move this gzipped catalog into pangeo-cmip6 so it can be served with the proper Content-Encoding, is that okay?
I'm currently working on the CMIP6 data browser hosted here; I would like to be able to point to the master version of CMIP6's associated JSON file and any others that are currently out there - some options for storing these files are:
This doesn't have to be where the CSV files are stored, considering they may be updated more often.
Do we want to have an online browser for NCAR's data stored on GLADE? This wouldn't allow users to interact with the data but could still give a good overview of what data is available on disk vs. cloud.