Update the datasets list to add more CMIP6 datasets to Caringo

agstephens commented 3 years ago

@RuthPetrie, I'd like to add some more data to the object-store Zarr holdings for CMIP6.

MattMiz recommended using the top-20 (non-ocean) variables from:

http://esgf-ui.cmcc.it/esgf-dashboard-ui/cmip6.html

If we do that we can download a CSV file, and then plug those into your script for querying CREPP. Does that sound reasonable to you?

agstephens commented 3 years ago

We now have an updated list:

https://github.com/cedadev/cmip6-object-store/blob/master/catalogs/cmip6-datasets_2020-10-27.csv

Let's use that @agstephens

RuthPetrie commented 3 years ago

I previously prepared a 200TB dataset list but it was never used did you want that one? I have added it via PR.

agstephens commented 3 years ago

Thanks @RuthPetrie

agstephens commented 3 years ago

@alaniwi: an update has been made to the CSV file that lists the input datasets that we should use when creating the Zarr files.

The total volume is now ~198TB so there should be plenty of conversion work to be done.

Please add this in as the source of the batches:

https://github.com/cedadev/cmip6-object-store/blob/master/catalogs/cmip6-datasets_2020-10-27.csv

You will need to:

[ ] remove the old batch files under: ./data/1.1/batch_*
[ ] re-generate the batch files from this new CSV file, using: python cmip6_object_store/cmip6_zarr/cli.py create-batches
- note that you will need to change the config.ini setting for: CONFIG["datasets"]["datasets_file"]
[ ] test that the pickle file is working by checking that it already thinks ~60,000 have been converted: python cmip6_object_store/cmip6_zarr/cli.py list -p cmip6 --count-only
[ ] test running the first batch locally (not on LOTUS) to confirm that it DOES NOT write anything to Caringo because it has already converted all the files for that batch.
[ ] test running the last batch locally (on sci3) to confirm that it DOES process the new datasets to Caringo because it had NOT YET converted the files for that batch.
[ ] Having confirmed/done all the above, get the new batch running on LOTUS

alaniwi commented 3 years ago

Issues.

sci3 is not talking to /badc/cmip6 at all. ls /badc/cmip6 just hangs indefinitely. What is written below was done on sci6.

I do not still have pickle files that indicate that thousands of datasets have already been done. The ones I have do not have that many entries.

If I try to run the first batch, then it does indeed attempt to convert the files. It claims that some of these succeed, for example:

Completed write for: CMIP6.DCPP.IPSL.IPSL-CM6A-LR/dcppC-ipv-pos.r1i1p1f1.Amon.huss.gr.v20190110.zarr

and that some of them fail, for example (paths redacted here):

ERROR:/path/to/cmip6-object-store/cmip6_object_store/cmip6_zarr/zarr_writer.py:FAILED TO COMPLETE FOR: CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.esm-piControl-spinup.r1i1p1f2.Amon.va.gr.v20181018
Failed to get Xarray dataset: CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.esm-piControl-spinup.r1i1p1f2.Amon.va.gr.v20181018:
Traceback (most recent call last):
  File "/path/to/cmip6-object-store/cmip6_object_store/cmip6_zarr/zarr_writer.py", line 57, in convert
    ds = self._get_ds(dataset_id)
  File "/path/to/cmip6-object-store/cmip6_object_store/cmip6_zarr/zarr_writer.py", line 92, in _get_ds
    ds = xr.open_mfdataset(file_pattern, use_cftime=True, combine="by_coords")
  File "/path/to/cmip6-object-store/venv/lib/python3.7/site-packages/xarray/backends/api.py", line 915, in open_mfdataset
    raise OSError("no files to open")
OSError: no files to open

Testing a claimed successful one to see what was written, a search for container CMIP6.DCPP.IPSL.IPSL-CM6A-LR finds one owned by Ag's username. Looking inside this, filtering the objects by dcppC-ipv-pos.r1i1p1f1.Amon.huss.gr.v20190110.zarr finds no results. Testing the filter by instead using dcppC-amv-ExTrop-neg.r10i1p1f1.Amon.ps.gr.v20190110.zarr (a known existing object seen on the file list) finds this object, so the filter itself is working. It is unclear where the objects that claim to have been written are going.

agstephens commented 3 years ago

This appeared to work:

http://cmip6-zarr-o.s3.jc.rl.ac.uk/CMIP6.DCPP.IPSL.IPSL-CM6A-LR/dcppC-ipv-pos.r1i1p1f1.Amon.huss.gr.v20190110.zarr/.zattrs

agstephens commented 3 years ago

Generic download URL for objects:

http://cmip6-zarr-o.s3.jc.rl.ac.uk/CMIP6.DCPP.IPSL.IPSL-CM6A-LR/dcppC-amv-ExTrop-neg.r10i1p1f1.Amon.clt.gr.v20190110.zarr/clt/0.0.0

alaniwi commented 3 years ago

Before launching batches on Lotus, I will need to check:

memory usage - is it being sent to nodes with sufficient memory?
batches: total number of lines in batch files (under data/1.1) is a bit less than the number of lines in the CSV file - check why this is

cedadev / cmip6-object-store

Update the datasets list to add more CMIP6 datasets to Caringo #42