Open agstephens opened 3 years ago
We now have an updated list:
https://github.com/cedadev/cmip6-object-store/blob/master/catalogs/cmip6-datasets_2020-10-27.csv
Let's use that @agstephens
I previously prepared a 200TB dataset list but it was never used did you want that one? I have added it via PR.
Thanks @RuthPetrie
@alaniwi: an update has been made to the CSV file that lists the input datasets that we should use when creating the Zarr files.
The total volume is now ~198TB so there should be plenty of conversion work to be done.
Please add this in as the source of the batches:
https://github.com/cedadev/cmip6-object-store/blob/master/catalogs/cmip6-datasets_2020-10-27.csv
You will need to:
[ ] remove the old batch files under: ./data/1.1/batch_*
[ ] re-generate the batch files from this new CSV file, using: python cmip6_object_store/cmip6_zarr/cli.py create-batches
config.ini
setting for: CONFIG["datasets"]["datasets_file"]
[ ] test that the pickle file is working by checking that it already thinks ~60,000 have been converted: python cmip6_object_store/cmip6_zarr/cli.py list -p cmip6 --count-only
[ ] test running the first batch locally (not on LOTUS) to confirm that it DOES NOT write anything to Caringo because it has already converted all the files for that batch.
[ ] test running the last batch locally (on sci3
) to confirm that it DOES process the new datasets to Caringo because it had NOT YET converted the files for that batch.
[ ] Having confirmed/done all the above, get the new batch running on LOTUS
Issues.
sci3
is not talking to /badc/cmip6
at all. ls /badc/cmip6
just hangs indefinitely. What is written below was done on sci6
.
I do not still have pickle files that indicate that thousands of datasets have already been done. The ones I have do not have that many entries.
If I try to run the first batch, then it does indeed attempt to convert the files. It claims that some of these succeed, for example:
Completed write for: CMIP6.DCPP.IPSL.IPSL-CM6A-LR/dcppC-ipv-pos.r1i1p1f1.Amon.huss.gr.v20190110.zarr
and that some of them fail, for example (paths redacted here):
ERROR:/path/to/cmip6-object-store/cmip6_object_store/cmip6_zarr/zarr_writer.py:FAILED TO COMPLETE FOR: CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.esm-piControl-spinup.r1i1p1f2.Amon.va.gr.v20181018
Failed to get Xarray dataset: CMIP6.CMIP.CNRM-CERFACS.CNRM-ESM2-1.esm-piControl-spinup.r1i1p1f2.Amon.va.gr.v20181018:
Traceback (most recent call last):
File "/path/to/cmip6-object-store/cmip6_object_store/cmip6_zarr/zarr_writer.py", line 57, in convert
ds = self._get_ds(dataset_id)
File "/path/to/cmip6-object-store/cmip6_object_store/cmip6_zarr/zarr_writer.py", line 92, in _get_ds
ds = xr.open_mfdataset(file_pattern, use_cftime=True, combine="by_coords")
File "/path/to/cmip6-object-store/venv/lib/python3.7/site-packages/xarray/backends/api.py", line 915, in open_mfdataset
raise OSError("no files to open")
OSError: no files to open
Testing a claimed successful one to see what was written, a search for container CMIP6.DCPP.IPSL.IPSL-CM6A-LR
finds one owned by Ag's username. Looking inside this, filtering the objects by dcppC-ipv-pos.r1i1p1f1.Amon.huss.gr.v20190110.zarr
finds no results. Testing the filter by instead using dcppC-amv-ExTrop-neg.r10i1p1f1.Amon.ps.gr.v20190110.zarr
(a known existing object seen on the file list) finds this object, so the filter itself is working. It is unclear where the objects that claim to have been written are going.
Generic download URL for objects:
Before launching batches on Lotus, I will need to check:
data/1.1
) is a bit less than the number of lines in the CSV file - check why this is
@RuthPetrie, I'd like to add some more data to the object-store Zarr holdings for CMIP6.
MattMiz recommended using the top-20 (non-ocean) variables from:
http://esgf-ui.cmcc.it/esgf-dashboard-ui/cmip6.html
If we do that we can download a CSV file, and then plug those into your script for querying CREPP. Does that sound reasonable to you?