PCMDI / input4MIPs_CVs

Controlled Vocabularies (CVs) for use in input4MIPs
https://input4mips-controlled-vocabularies-cvs.readthedocs.io/en/stable/
Creative Commons Attribution 4.0 International
5 stars 1 forks source link

Land-use data upload #123

Closed znichollscr closed 1 month ago

znichollscr commented 2 months ago

Issue for tracking the progress and any issues related to the land-use data.

cc @lchini @durack1 @vnaik60

lchini commented 2 months ago

I am attempting to upload the files to ftp.llnl.gov. The instructions indicate to upload files to the "incoming" folder, but when I connect to the FTP server (using FileZilla on my Mac) there is just an empty root directory and no folder named "incoming". Should I just upload to that location or am I not connected correctly?

durack1 commented 2 months ago

Good question @lchini, we've had a similar query from @mjevanmarle this morning.

How about if you try and connect explicitly using the IP address 198.128.250.1? See below, seems to work for me currently Screenshot 2024-09-18 at 7 50 51 AM

Once in you should be able to navigate to the incoming subdirectory, create a new directory for yourself (e.g., UofMD-landState-3-0_240918, we might have to try a couple of times, so adding a datestamp on the end) upload and voila

znichollscr commented 2 months ago

Option b: here's a python script that should work. You'll need to install input4mips-validation first.

Python script ```python import ftplib import os import traceback from pathlib import Path from input4mips_validation.upload_ftp import cd_v, login_to_ftp, mkdir_v, upload_file # Point this at the path which contains the files you want to upload # PATH_TO_DIRECTORY_TO_UPLOAD = ( "output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4" ) PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere" # Use your email here # EMAIL = "zebedee.nicholls@climate-resource.com" EMAIL = "your_email" # Use a unique value here # FTP_DIR_REL_TO_ROOT = "cr-junk-2" FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1" FTP_DIR_ROOT = "/incoming" with login_to_ftp( ftp_server="ftp.llnl.gov", username="anonymous", password=EMAIL, dry_run=False, ) as ftp: print("Opened FTP connection") print() cd_v(FTP_DIR_ROOT, ftp=ftp) mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp) cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp) n_errors = 0 n_total = 0 for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"): file_stats = os.stat(file) file_size_mb = file_stats.st_size / (1024 * 1024) file_size_gb = file_stats.st_size / (1024 * 1024 * 1024) print(f"{file=}") print(f"{file_size_mb=:.3f}") print(f"{file_size_gb=:.3f}") try: upload_file( file, strip_pre_upload=file.parent, ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}", ftp=ftp, ) print(f"Uploaded {file=}") except ftplib.error_perm: print(f"Failed to upload {file=}") traceback.print_exc() n_errors += 1 n_total += 1 print() print(f"Finished: {n_errors=}, {n_total=}") ```
durack1 commented 2 months ago

@lchini @mjevanmarle it seems like there is a DNS/network issue that is causing problems for me when I am not connected to the LLNL institutional network. Weirdly, this isn't an issue for @znichollscr, so might be something that will just work itself out, or will need a nudge within the LLNL network.

This is what I see, which looks similar to @mjevanmarle's issue, and @lchini probably your issue too Screenshot 2024-09-18 at 7 58 34 AM

I'll raise a ticket with the LLNL network folks to see if someone can check.

lchini commented 2 months ago

Yes that is the same issue that I'm experiencing. I've been trying to configure settings on my end but it sounds like I might need to wait for the LLNL server update.

znichollscr commented 2 months ago

@lchini can you try the python script and post the output here if it fails please?

znichollscr commented 2 months ago

Python script is here

import ftplib
import os
import traceback
from pathlib import Path

from input4mips_validation.upload_ftp import cd_v, login_to_ftp, mkdir_v, upload_file

# Point this at the path which contains the files you want to upload
# PATH_TO_DIRECTORY_TO_UPLOAD = (
    "output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4"
)
PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere"

# Use your email here
# EMAIL = "zebedee.nicholls@climate-resource.com"
EMAIL = "your_email"

# Use a unique value here
# FTP_DIR_REL_TO_ROOT = "cr-junk-2"
FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1"

FTP_DIR_ROOT = "/incoming"

with login_to_ftp(
    ftp_server="ftp.llnl.gov",
    username="anonymous",
    password=EMAIL,
    dry_run=False,
) as ftp:
    print("Opened FTP connection")
    print()

    cd_v(FTP_DIR_ROOT, ftp=ftp)

    mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
    cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)

    n_errors = 0
    n_total = 0
    for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"):
        file_stats = os.stat(file)
        file_size_mb = file_stats.st_size / (1024 * 1024)
        file_size_gb = file_stats.st_size / (1024 * 1024 * 1024)

        print(f"{file=}")
        print(f"{file_size_mb=:.3f}")
        print(f"{file_size_gb=:.3f}")

        try:
            upload_file(
                file,
                strip_pre_upload=file.parent,
                ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}",
                ftp=ftp,
            )
            print(f"Uploaded {file=}")

        except ftplib.error_perm:
            print(f"Failed to upload {file=}")
            traceback.print_exc()
            n_errors += 1

        n_total += 1
        print()

print(f"Finished: {n_errors=}, {n_total=}")
lchini commented 2 months ago

I tried to install input4mips-validation so that I could run the python script. I used pip since I didn't have mamba installed. Although pip did not return an error, I don't think the installation worked correctly because when I tried to run the python script I received the error: No module named 'input4mips_validation'

durack1 commented 2 months ago

I've tried to engage with LLNL comp support folks and had a tepid response, so if this doesn't begin to work next time we try, let's seek alternative paths to getting these data in the publication queues

znichollscr commented 2 months ago

I tried to install input4mips-validation so that I could run the python script. I used pip since I didn't have mamba installed. Although pip did not return an error, I don't think the installation worked correctly because when I tried to run the python script I received the error: No module named 'input4mips_validation'

Hmmm that's unfortunate. Here's a version of the script without any dependencies that aren't in the standard library so it should just work with any Python >= 3.9. Can you try that please?

import ftplib
import os
import traceback
from collections.abc import Iterator
from contextlib import contextmanager
from pathlib import Path
from typing import Optional

# Point this at the path which contains the files you want to upload
# PATH_TO_DIRECTORY_TO_UPLOAD = (
#     "output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4"
# )
PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere"

# Use your email here
# EMAIL = "zebedee.nicholls@climate-resource.com"
EMAIL = "your_email"

# Use a unique value here
# FTP_DIR_REL_TO_ROOT = "cr-junk-4"
FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1"

FTP_DIR_ROOT = "/incoming"

@contextmanager
def login_to_ftp(
    ftp_server: str, username: str, password: str, dry_run: bool
) -> Iterator[Optional[ftplib.FTP]]:
    """
    Create a connection to an FTP server.

    When the context block is excited, the connection is closed.

    If we are doing a dry run, `None` is returned instead
    to signal that no connection was actually made.
    We do, however, log messages to indicate what would have happened.

    Parameters
    ----------
    ftp_server
        FTP server to login to

    username
        Username

    password
        Password

    dry_run
        Is this a dry run?

        If `True`, we won't actually login to the FTP server.

    Yields
    ------
    :
        Connection to the FTP server.

        If it is a dry run, we simply return `None`.
    """
    if dry_run:
        print(f"Dry run. Would log in to {ftp_server} using {username=}")
        ftp = None

    else:
        ftp = ftplib.FTP(ftp_server, passwd=password, user=username)  # noqa: S321
        print(f"Logged into {ftp_server} using {username=}")

    yield ftp

    if ftp is None:
        if not dry_run:  # pragma: no cover
            raise AssertionError
        print(f"Dry run. Would close connection to {ftp_server}")

    else:
        ftp.quit()
        print(f"Closed connection to {ftp_server}")

def cd_v(dir_to_move_to: str, ftp: ftplib.FTP) -> ftplib.FTP:
    """
    Change directory verbosely

    Parameters
    ----------
    dir_to_move_to
        Directory to move to on the server

    ftp
        FTP connection

    Returns
    -------
    :
        The FTP connection
    """
    ftp.cwd(dir_to_move_to)
    print(f"Now in {ftp.pwd()} on FTP server")

    return ftp

def mkdir_v(dir_to_make: str, ftp: ftplib.FTP) -> None:
    """
    Make directory verbosely

    Also, don't fail if the directory already exists

    Parameters
    ----------
    dir_to_make
        Directory to make

    ftp
        FTP connection
    """
    try:
        print(f"Attempting to make {dir_to_make} on {ftp.host=}")
        ftp.mkd(dir_to_make)
        print(f"Made {dir_to_make} on {ftp.host=}")
    except ftplib.error_perm:
        print(f"{dir_to_make} already exists on {ftp.host=}")

def upload_file(
    file: Path,
    strip_pre_upload: Path,
    ftp_dir_upload_in: str,
    ftp: Optional[ftplib.FTP],
) -> Optional[ftplib.FTP]:
    """
    Upload a file to an FTP server

    Parameters
    ----------
    file
        File to upload.

        The full path of the file relative to `strip_pre_upload` will be uploaded.
        In other words, any directories in `file` will be made on the
        FTP server before uploading.

    strip_pre_upload
        The parts of the path that should be stripped before the file is uploaded.

        For example, if `file` is `/path/to/a/file/somewhere/file.nc`
        and `strip_pre_upload` is `/path/to/a`,
        then we will upload the file to `file/somewhere/file.nc` on the FTP server
        (relative to whatever directory the FTP server is in
        when we enter this function).

    ftp_dir_upload_in
        Directory on the FTP server in which to upload `file`
        (after removing `strip_pre_upload`).

    ftp
        FTP connection to use for the upload.

        If this is `None`, we assume this is a dry run.

    Returns
    -------
    :
        The FTP connection.

        If it is a dry run, this can simply be `None`.
    """
    print(f"Uploading {file}")
    if ftp is None:
        print(f"Dry run. Would cd on the FTP server to {ftp_dir_upload_in}")

    else:
        cd_v(ftp_dir_upload_in, ftp=ftp)

    filepath_upload = file.relative_to(strip_pre_upload)
    print(
        f"Relative to {ftp_dir_upload_in} on the FTP server, " f"will upload {file} to {filepath_upload}",
    )

    for parent in list(filepath_upload.parents)[::-1]:
        if parent == Path("."):
            continue

        to_make = parent.parts[-1]

        if ftp is None:
            print("Dry run. " "Would ensure existence of " f"and cd on the FTP server to {to_make}")

        else:
            mkdir_v(to_make, ftp=ftp)
            cd_v(to_make, ftp=ftp)

    if ftp is None:
        print(f"Dry run. Would upload {file}")

        return ftp

    with open(file, "rb") as fh:
        upload_command = f"STOR {file.name}"
        print(f"Upload command: {upload_command}")

        try:
            print(f"Initiating upload of {file}")
            ftp.storbinary(upload_command, fh)

            print(f"Successfully uploaded {file}")
        except ftplib.error_perm:
            print(
                f"{file.name} already exists on the server in {ftp.pwd()}. "
                "Use a different directory on the receiving server "
                "if you really wish to upload again."
            )
            raise

    return ftp

with login_to_ftp(
    ftp_server="ftp.llnl.gov",
    username="anonymous",
    password=EMAIL,
    dry_run=False,
) as ftp:
    print("Opened FTP connection")
    print()

    cd_v(FTP_DIR_ROOT, ftp=ftp)

    mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
    cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)

    n_errors = 0
    n_total = 0
    for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"):
        file_stats = os.stat(file)
        file_size_mb = file_stats.st_size / (1024 * 1024)
        file_size_gb = file_stats.st_size / (1024 * 1024 * 1024)

        print(f"{file=}")
        print(f"{file_size_mb=:.3f}")
        print(f"{file_size_gb=:.3f}")

        try:
            upload_file(
                file,
                strip_pre_upload=file.parent,
                ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}",
                ftp=ftp,
            )
            print(f"Uploaded {file=}")

        except ftplib.error_perm:
            print(f"Failed to upload {file=}")
            traceback.print_exc()
            n_errors += 1

        n_total += 1
        print()

print(f"Finished: {n_errors=}, {n_total=}")
lchini commented 2 months ago

Thanks for the new script Zeb. I'm running it now and it appears to be working, although it's hard to gauge progress on the other end. The first time I ran the script it appeared to be uploading ALL files within the given directory, so I canceled that and moved to a different folder. So there might be some half-uploaded files from that first run.

lchini commented 2 months ago

The script completed. Can someone else confirm that it was successful? I just uploaded a single file because I wanted to make sure everything looks OK with that one before sending the others. Let me know if there is anything I need to change with the format or metadata in the file. Also, I assume the filename will be changed from the name of the uploaded file?

znichollscr commented 2 months ago

The first time I ran the script it appeared to be uploading ALL files within the given directory, so I canceled that and moved to a different folder. So there might be some half-uploaded files from that first run.

Ah yes it uploads every .nc file it can find, should have warned you about that probably :)

Can someone else confirm that it was successful?

Hopefully @durack1 can take a look. Can you tell us which directory you uploaded in (i.e. the value of FTP_DIR_REL_TO_ROOT in the script)?

I just uploaded a single file because I wanted to make sure everything looks OK with that one before sending the others. Let me know if there is anything I need to change with the format or metadata in the file. Also, I assume the filename will be changed from the name of the uploaded file?

Sounds good. We'll take a look and get back to you asap

durack1 commented 2 months ago

@lchini great! We're off, I could see the below, so if that looks right to you - mint a new upload dir and give us the lot. Screenshot 2024-09-19 at 6 51 34 AM

If you can also indicate what we're to expect, number of files, then I can double check these and then drop them into the publication queue, where we can runs @znichollscr validator to double check

znichollscr commented 2 months ago

Alrighty looks like Paul found it so don't worry, we don't need any more info for now. I'll take a look and get back to you asap.

Also, I assume the filename will be changed from the name of the uploaded file?

Yep we'll re-write that as part of putting the file in the DRS

durack1 commented 2 months ago

@znichollscr the 2 files are in the normal place - ../LouiseChini-landUseChange/20240919

znichollscr commented 2 months ago

Alrighty:

I'm assuming that management4.nc is a half uploaded file because I couldn't even read it with ncdump...

For states_new_vars2.nc:

Other than that, looks good I think

durack1 commented 2 months ago

@znichollscr yep looks like you're right - we have a bigger version of that file now, AND another transitions file. @lchini so I wait until it's all up, what should we be expecting, how many files and their filenames/sizes? Screenshot 2024-09-19 at 10 36 32 AM

I might wait until I've heard back from you, and wait until the complete set is down before I pull these across

lchini commented 2 months ago

The management4.nc file was uploaded in error when I didn't realize that the python script would upload all files in the given directory. So please delete that one. There are 4 files that I'll be uploading for states, transitions, and management, as well as a staticData file. The issues that Zeb pointed out with the states file will be issues in the transitions and management files too. I've already uploaded the transitions so will have to fix and re-upload that one as well as the states file, and I'll try to update the management file before I upload it.

lchini commented 2 months ago

For the time units, the product is annual. We originally created time units that give the actual year, e.g. 850, 851, 852, etc. I post-processed that to give years since 850, e.g.: 0,1,2,3 .... Should I revert to the original plan or switch to days since 850 as you suggested. We have 1175 years of data so a simple multiplication by 365 will end up missing quite a few days due to leap years.

durack1 commented 2 months ago

So please delete that one

Unfortunately I can't do anything about deleting/moving/etc on this system, it's simply a dropbox.. So good to know I'll purge it in our cop(ies) once I pull the complete file list down.

When you have the new data generated, upload this to a new directory e.g., UofMD-landState-3-0_240919_1 and that way we won't have problems with attempts to overwrite files etc, which likely won't work.

Also we have a standard template for the filenames (and directory structure, which I can impose once the files are down and their metadata matches what we expect), so this should be something like transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2023.nc

durack1 commented 2 months ago

For the time units, the product is annual. We originally created time units that give the actual year, e.g. 850, 851, 852, etc. I post-processed that to give years since 850, e.g.: 0,1,2,3 .... Should I revert to the original plan or switch to days since 850 as you suggested. We have 1175 years of data so a simple multiplication by 365 will end up missing quite a few days due to leap years.

UDUNITS defines a year to be exactly 365.242198781 days (the interval between 2 successive passages of the sun through vernal equinox, yes pedantic). So if we are mapping into days since, then we'd have to be careful about @znichollscr suggested multiplication, as this will lead to problems toward the end of the record. In addition as you span the Gregorian (1582-10-04 to 1582-10-15 the next day) hop, this is going to get a little weird.. @lchini how are you writing these files, what software? the python datetime library and cftime could help here

lchini commented 2 months ago

Our model that generates the data and writes the files is written in C++. I am doing some post-processing on the files in MATLAB (just to add in the new variables that don't have computed data yet), and then doing more post-processing (modifying the time dimension, writing global attributes, etc) using NCO command-line tools.

I guess my question is: since converting to days is tricky, is it really necessary? Especially since our data is an annual product?

durack1 commented 2 months ago

I guess my question is: since converting to days is tricky, is it really necessary? Especially since our data is an annual product?

To be honest, your file looks pretty good to me (ncdump -ct $file.nc):

variables:
        double time(time) ;
                time:axis = "T" ;
                time:calendar = "noleap" ;
                time:long_name = "time" ;
                time:realtopology = "linear" ;
                time:standard_name = "time" ;
                time:units = "years since 850-01-01 0:0:0" ;

...

data:

 time = "0850-01-01", "0851-01-01", "0852-01-01", "0853-01-01", "0854-01-01", 
    "0855-01-01", "0856-01-01", "0857-01-01", "0858-01-01", "0859-01-01", 
...
    "2015-01-01", "2016-01-01", "2017-01-01", "2018-01-01", "2019-01-01", 
    "2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01" ;

The xarray warning is

>>> fh = xcdat.open_dataset("transitions_new_vars2.nc")
../lib/python3.11/site-packages/xarray/coding/times.py:167: SerializationWarning: Ambiguous reference
date string: 850-01-01 0:0:0. The first value is assumed to be the year hence will be padded with zeros
to remove the ambiguity (the padded reference date string is: 0850-01-01 0:0:0). To remove this
message, remove the ambiguity by padding your reference date strings with zeros.
  warnings.warn(warning_msg, SerializationWarning)
>>> fh.time.data
array([cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       ...,
       cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
      dtype=object)

A quick tweak 850-1-1 0:0:0 to 0850-01-01 0:0:0 might get you most of the way there. Adding a time_bnds would be ideal too, so that we satisfy CF requirements this would mean that the first year is bounded by 0850-01-01 0:0:0 and 0851-01-01 0:0:0, this would also be cleaner if the time axis value was the central value within the annual time period, so 0850-07-02

durack1 commented 2 months ago

There's also a couple of inconsistencies in the file metadata vs what we are expecting,

// global attributes:
                :host = "UMD College Park" ;
                :creation_date = "2024-07-18T14:51:50Z" ;
                :Conventions = "CF-1.6" ;
                :data_structure = "grid" ;
                :dataset_category = "landState" ;
                :variable_id = "multiple" ;
                :grid_label = "gn" ;

                :mip_era = "CMIP6" ;  ## CMIP6Plus

                :license = "Land-Use Harmonization data produced by the University of Maryland is licensed under a Creative Commons Attribution \\\"Share Alike\\\" 4.0 International License (http://creativecommons.org/licenses
/by/4.0/). The data producers and data providers make no warranty, either express or implied, including but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the s
upply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
                :further_info_url = "http://luh.umd.edu" ;
                :frequency = "yr" ;
                :institution_id = "UofMD" ;
                :institution = "University of Maryland (UofMD), College Park, MD 20742, USA" ;
                :realm = "land" ;
                :source = "LUH3 V0: Land-Use Harmonization Data Set for CMIP7" ;
                :comment = "LUH3 V0" ;
                :title = "UofMD LUH3 V0 dataset prepared for CMIP7" ;

                :activity_id = "CMIP7" ;  ### input4MIPs

                :dataset_version_number = "LUH3 V0" ;

                :source_id = "UofMD-landState-LUH3" ;  ## UofMD-landState-3-0

                :target_mip = "CMIP7" ;  ### CMIP

                :references = "Hurtt et al. 2020, Chini et al. 2021" ; ## Want to expand these with DOIs?

                :contact = "lchini@umd.edu, gchurtt@umd.edu" ;
znichollscr commented 2 months ago

Should I revert to the original plan or switch to days since 850 as you suggested. We have 1175 years of data so a simple multiplication by 365 will end up missing quite a few days due to leap years.

(This is completely non-obvious unless you love the CF-conventions), because you're using a 'noleap' calendar, every year in your calendar has exactly 365 days. Hence, you can do the multiplication by 365 without an issue (just don't change the calendar attribute of your time variable!).

UDUNITS defines a year to be exactly 365.242198781 days (the interval between 2 successive passages of the sun through vernal equinox, yes pedantic). So if we are mapping into days since, then we'd have to be careful about @znichollscr suggested multiplication, as this will lead to problems toward the end of the record. In addition as you span the Gregorian (1582-10-04 to 1582-10-15 the next day) hop, this is going to get a little weird..

See above. Because of the calendar attribute, UDUNITS doesn't come into it and just multiplying by 365 is fine (again, this statement only applies because of the "noleap" calendar).

since converting to days is tricky, is it really necessary? Especially since our data is an annual product?

As above, because of the calendar, converting to days is trivial. The reason I would (strongly) recommend doing this is that the data doesn't load properly with xarray if the time units are "years since" rather than "days since". This is a bug in xarray, but given it is such a widely used tool, I would recommend making this tweak (particularly given how trivial it is).

The xarray warning is

Note here @durack1 that you've loaded with xcdat, not xarray (I assume xcdat is better up to speed with CF-conventions than xarray/cftime, which is what raises the original error). If you try to load with xarray you get:

click me to see the full xarray error ```python >>> import xarray as xr >>> xr.open_dataset("states_new_vars2.nc", use_cftime=True) Traceback (most recent call last): File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 218, in _decode_cf_datetime_dtype result = decode_cf_datetime(example_value, units, calendar, use_cftime) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 349, in decode_cf_datetime dates = _decode_datetime_with_cftime(flat_num_dates, units, calendar) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 242, in _decode_datetime_with_cftime cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True) File "src/cftime/_cftime.pyx", line 587, in cftime._cftime.num2date File "src/cftime/_cftime.pyx", line 105, in cftime._cftime._dateparse ValueError: In general, units must be one of 'microseconds', 'milliseconds', 'seconds', 'minutes', 'hours', or 'days' (or select abbreviated versions of these). For the '360_day' calendar, 'months' can also be used, or for the 'noleap' calendar 'common_years' can also be used. Got 'years' instead, which are not recognized. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 450, in decode_cf_variables new_vars[k] = decode_cf_variable( ^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 291, in decode_cf_variable var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 992, in decode dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 228, in _decode_cf_datetime_dtype raise ValueError(msg) ValueError: unable to decode time units 'years since 850-01-01 0:0:0' with "calendar 'noleap'". Try opening your dataset with decode_times=False or installing cftime if it is not installed. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "", line 1, in File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/api.py", line 588, in open_dataset backend_ds = backend.open_dataset( ^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 659, in open_dataset ds = store_entrypoint.open_dataset( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/store.py", line 46, in open_dataset vars, attrs, coord_names = conventions.decode_cf_variables( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 461, in decode_cf_variables raise type(e)(f"Failed to decode variable {k!r}: {e}") from e ValueError: Failed to decode variable 'time': unable to decode time units 'years since 850-01-01 0:0:0' with "calendar 'noleap'". Try opening your dataset with decode_times=False or installing cftime if it is not installed. ```

so this should be something like transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2023.nc

I don't think this matters for us though does it Paul? We'll just re-write with the correct name and save @lchini the headache? If you do want to write it yourself, the current DRS suggests the filename should start with "multiple-*" (e.g. multiple-transitions, multiple-states) because there are multiple variables in the file.

lchini commented 2 months ago

Thanks for this info. I think most of these changes will be easy to implement and I will get started on it right away. The issue of the time variable and time bounds should also be OK but I just wanted to make sure I get this right. As I understand it, the plan is the following: 1) change the time variable to be days since 850-1-1 0:0:0 and multiply the current values by 365 to convert the existing values (this is OK since my calendar is does not include leap years, and I assume it is not impacted by the 1582 calendar jump). So we will have time values of 0, 365, 730, ... 2) add a time bounds variable that is 2 dimensional - one dimension will be the same size as the time variable and the other dimension will have a length of 2. Since the time variable will be days since 850, ie 0, 365, 730, 1095 ..., the time bounds variable will be [[0,365],[365,730],[730,1095],...]

Does this sound correct?

Questions: 1) Do I also need to change 850 to 0850? 2) Regarding the idea of making the time variable the central value for the year, i.e. 0850-07-02, we actually consider the land use states in each year to be the states on jan 1, so I would prefer not to make that change. The transitions in the year 850 are actually the transitions that occur during that year (from jan 1 850 to Jan 1 851) so they are not really tied to a specific date but span that time period. So, do I need to change anything here to reflect that or leave it as it is?

znichollscr commented 2 months ago
  1. change the time ...

All correct. (The 1582 calendar change also doesn't matter as all you're really saying with your data is, "this is the start of year" state, which is what the approach you're taking will do.)

2. add a time bounds variable...

Spot on. I think the variable is meant to be called time_bnds according to CF-conventions. When you do this, it's also recommend (perhaps required) to add a "bounds" attribute to the "time" variable that has the value "time_bnds". (In Python that would be something like ds["time"].setncattr("bounds", "time_bnds"). I know you don't use Python, but that might help make things clearer.)

Do I also need to change 850 to 0850?

I don't think it matters, but I don't think it will hurt either and it will make it easier for tools that expect 4 digits in their year so I would do this if it were me (I'm assuming it is a very easy change).

2. So, do I need to change anything here to reflect that or leave it as it is?

Given the info you have provided, I would leave as is.

znichollscr commented 2 months ago

I think most of these changes will be easy to implement and I will get started on it right away

Exciting!

durack1 commented 2 months ago

Everything that @znichollscr said is spot on.

Do I also need to change 850 to 0850?

I would. Several software packages are finicky about these things, and vanilla array (or, more correctly, pandas) is one of them. Also as inferred in the CF conventions documentation, years since is not a recommended time units for exactly the calendar reasons that are discussed above. Best to avoid such tripwires if we can. As an aside, having "days since 0000-01-01 0:0:0" also leads many software packages to blow up, as a zero year doesn't exist in any calendar - the creative ways software deals with this is interesting.

2. Regarding the idea of making the time variable the central value for the year, i.e. 0850-07-02, we actually consider the land use states in each year to be the states on jan 1, so I would prefer not to make that change.

Also fine with me. The bounds capture the time range that these states are valid for, and you have good reason to specify the start day of the year, so also fine.

I think you've got info intel from us @lchini so please chime in if there are remaining questions. Once you have a file ready to validate, just upload (the smallest of these) and we can quickly check things out, get any feedback to you and then get these finalized and published.

Margreet's biomass burning emissions data was just dropped into the ESGF publication queue this afternoon, so we are getting close to the target of nearly all datasets in place... Woo hoo!

lchini commented 2 months ago

I've made the requested changes to the land-use states file and uploaded it to the FTP server. I will work on making the same changes to the other files now as well. If you see anything in the new file that is not quite right, let me know and I will fix them.

durack1 commented 2 months ago

@lchini this looks good to me, the issues highlighted above (https://github.com/PCMDI/input4MIPs_CVs/issues/123#issuecomment-2361918646) are fixed.

It seems you've hardcoded the :creation_date = "2024-07-18T14:51:36Z", you might want to generate this automatically, so we don't have old info lurking around in files.

For a python example (which may or may not be useful), see below

$ python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime
>>> print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))
2024-09-23T21:33:04Z

@znichollscr the single files is on nimbus ../LouiseChini-landUseChange/20240923

Also a question to you, did you want to rename these files so what you are producing is consistent with what will be downloaded from ESGF? This is optional, but we will confuse folks if we have inconsistent filenames from differing sources, even if their content is identical. @znichollscr highlighted the renaming above (https://github.com/PCMDI/input4MIPs_CVs/issues/123#issuecomment-2362056200)

durack1 commented 2 months ago

And just adding another note, looks like the time axis fix has solved the xarray read problems, at least for me

 python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xarray as xr
>>> fh = xr.open_dataset("../LouiseChini-landUseChange/20240923/states_new_vars3.nc")
>>> fh
<xarray.Dataset>
Dimensions:    (time: 1175, lat: 720, lon: 1440, nbnd: 2)
Coordinates:
  * time       (time) object 0850-01-01 00:00:00 ... 2024-01-01 00:00:00
  * lat        (lat) float64 89.88 89.62 89.38 89.12 ... -89.38 -89.62 -89.88
  * lon        (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
Dimensions without coordinates: nbnd
Data variables: (12/16)
    primf      (time, lat, lon) float32 ...
    primn      (time, lat, lon) float32 ...
    secdf      (time, lat, lon) float32 ...
    secdn      (time, lat, lon) float32 ...
    urban      (time, lat, lon) float32 ...
    c3ann      (time, lat, lon) float32 ...
    ...         ...
    pastr      (time, lat, lon) float32 ...
    range      (time, lat, lon) float32 ...
    secmb      (time, lat, lon) float32 ...
    secma      (time, lat, lon) float32 ...
    pltns      (time, lat, lon) float32 ...
    time_bnds  (nbnd, time) object ...
Attributes: (12/25)
    host:                    UMD College Park
    creation_date:           2024-07-18T14:51:36Z
    Conventions:             CF-1.6
    data_structure:          grid
    dataset_category:        landState
    variable_id:             multiple
    ...                      ...
    source_id:               UofMD-landState-3-0
    target_mip:              CMIP
    mip_era:                 CMIP6Plus
    references:              Hurtt et al. 2020 (https://doi.org/10.5194/gmd-1...
    history:                 Mon Sep 23 13:31:22 2024: ncrename -a ._Fillvalu...
    NCO:                     netCDF Operators version 5.0.0 (Homepage = HTTP:...
>>>
znichollscr commented 2 months ago

Hi @lchini looking good. Tweaks from this round below:

Thanks!

lchini commented 2 months ago

Thanks for these additional modifications Zeb. I've updated the states file and uploaded it to the FTP server. I'm working on the transitions and management files now to implement the same changes. The management file has a couple of variables where the standard_name attribute is listed as 'biomass_fraction', which I now realize is not a standard name. So, I'm assuming I should just remove standard_name for those variables, like I did with secma in the states file?

Also, the creation_date attribute is generated automatically when I create the data. The files that I'm uploading are based on the data that I created on July 18, 2024. Since then I have just been modifying the files with these metadata corrections etc, and I did also add in some placeholder variables that we will fill with actual data in the next release. I did not change the creation_date attribute when I made those changes. But moving forward the creation_date will update based on the date when the new data gets generated.

znichollscr commented 2 months ago

The management file has a couple of variables where the standard_name attribute is listed as 'biomass_fraction', which I now realize is not a standard name. So, I'm assuming I should just remove standard_name for those variables, like I did with secma in the states file?

Yep, for these cases: a) remove "standard_name" and b) make sure that there is at least a value for "long_name".

Also, the creation_date attribute is generated automatically when I create the data....

Ah ok. We normally use that for when the file is created, rather than the data, so we can tell the difference between files more easily (even if they have the same name, the creation date helps us differentiate). It's probably not essential to change though (although @durack1 can correct me).

Speaking of identifying files, the other thing attribute we're missing is "tracking_id". This should be file specific and generated following the UUID4 protocol (re-generated every time you write a new file). In Python, it can be generated with code like the below

import uuid
tracking_id = "hdl:21.14100/" + str(uuid.uuid4())

In Matlab, it's a bit less clear to me but that's also because I'm worse at reading matlab docs I think.

znichollscr commented 2 months ago

(Although, to be honest, I would be ok with skipping tracking_id for this first set of files and just picking it up next time we go round...)

durack1 commented 2 months ago

Hi folks, I'm sorry, but the tracking_id is an ESGF dependency that needs to be unique per file. Apologies for omitting that check. Below is a matlab example of generating a compatible UUID4.

>> disp(join(["hdl:21.14100",char(java.util.UUID.randomUUID)],"/")) # Matlab R2023a
hdl:21.14100/df3a5513-ee63-4969-aff4-5efc4e71f4bc
Which matches the format of the python UUID4
:tracking_id = "hdl:21.14100/c0045041-73e0-4e75-b36d-38a962fb813c" ; 

Above example is from the PCMDI-AMIP-1-1-6 example here

And a matlab example of creating a creation_date which aligns with the ESGF expectation

>> disp(join([replace(char(datetime('now','Format','yyyy-MM-dd_HH:mm:ss','Timezone','Z')),'_','T'),'Z']))
2024-09-25T17:42:47Z
Matching the python
>>> import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))
2024-09-25T17:44:04Z

@znichollscr the latest transitions file is now in ../LouiseChini-landUseChange/20240925

znichollscr commented 2 months ago

Hi folks, I'm sorry, but the tracking_id is an ESGF dependency that needs to be unique per file

That settles that then :)

@znichollscr the latest transitions file is now in ../LouiseChini-landUseChange/20240925

Having looked at it now, it looks like most of your variables don't have a true standard name. Standard names never contain whitespace, so anytime there is whitespace in a standard name, that information should either be in "long_name" or, if there's already a "long_name", you can just delete the "standard_name" information entirely.

Looking closer at the values, I would say that I would be surprised if any of your variables had standard names (the full list is here: https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html). The thing that is suggesting that to me is that lots of your variables have the same standard name, but I don't think two different variables can have the same standard name (so it seems like the standard names are wrong to me). I could be mistaken of course.

durack1 commented 2 months ago

most of your variables don't have a true standard name

I 100% agree with @znichollscr if there is not a very definitive mapping to a CF standard name that has been approved and listed on the v86 of the CF Standard Name table then let's remove "standard_name" and rather go with the descriptive "long_name" attribute alone. If we want to jump through hoops to get a standard name assigned, we can do that on the second go around

lchini commented 2 months ago

OK, sounds good. I'll remove "standard_name" for all variables. I assume it's OK/preferable to keep the existing standard_name for time, lat, and lon? I can also add the tracking_id. Do I need to do anything about creation_date at this stage or leave it as is for now? BTW, I've now uploaded a full set of files: states, transitions, management, and staticData. I know that most of your comments so far have been for the states file, so if you want to take a look at those other ones as well and let me know whether they are compliant, that would be helpful.

durack1 commented 2 months ago

I assume it's OK/preferable to keep the existing standard_name for time, lat, and long?

Yep, these standard_names and all other attributes are registered standards that you are using correctly

Do I need to do anything about creation_date at this stage or leave it as is for now?

The creation_date is meant to indicate the date that I file was generated, and this (and other files) was not generated on the 18 July 2024, so I would prefer we update this, and preferably update this automatically as files are written - just so we don't create this inconsistency again. As @znichollscr notes, the creation_date is one of the attributes if used correctly that allows a crumbtrail of when a file was generated, and so in most cases presumably the latest dated file is the preferred.

In the CMIP6 example file (here), this lists the following attributes as "absolutely essential": Conventions, activity_id, contact, creation_date, dataset_category, frequency, further_info_url, grid_label, institution, institution_id, mip_era, nominal_resolution, realm, source, source_id, source_version, target_mip, title, tracking_id, variable_id.

Looking at the below, we're now all good, aside from the bolded entries above nominal_resolution = "25 km", rename "dataset_version_number" -> source_version = "3.0", tracking_id matlab code as above https://github.com/PCMDI/input4MIPs_CVs/issues/123#issuecomment-2374766167 and the update to creation_date also above.

ncdump -ct multiple-transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc
...
// global attributes:
                :host = "UMD College Park" ;
                :creation_date = "2024-07-18T14:51:50Z" ;
                :Conventions = "CF-1.6" ;
                :data_structure = "grid" ;
                :dataset_category = "landState" ;
                :grid_label = "gn" ;
                :license = "Land-Use Harmonization data produced by the University of Maryland is licensed under a Creative Commons Attribution \\\"Share Alike\\\" 4.0 International License (http://creativecommons.org/licenses/by/4.0/). The data producers and data providers make no warranty, either express or implied, including but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
                :further_info_url = "http://luh.umd.edu" ;
                :frequency = "yr" ;
                :institution_id = "UofMD" ;
                :institution = "University of Maryland (UofMD), College Park, MD 20742, USA" ;
                :realm = "land" ;
                :source = "LUH3 V0: Land-Use Harmonization Data Set for CMIP7" ;
                :comment = "LUH3 V0" ;
                :title = "UofMD LUH3 V0 dataset prepared for CMIP7" ;
                :dataset_version_number = "LUH3 V0" ;
                :contact = "lchini@umd.edu, gchurtt@umd.edu" ;
                :activity_id = "input4MIPs" ;
                :source_id = "UofMD-landState-3-0" ;
                :target_mip = "CMIP" ;
                :mip_era = "CMIP6Plus" ;
                :references = "Hurtt et al. 2020 (https://doi.org/10.5194/gmd-13-5425-2020), Chini et al. 2021 (https://doi.org/10.5194/essd-13-4175-2021)" ;
                :history = "Wed Sep 25 09:16:37 2024: ncrename -a ._Fillvalue,_FillValue transitions_new_vars3.nc" ;
                :NCO = "netCDF Operators version 5.0.0 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)" ;
                :variable_id = "multiple-transitions" ;
...
znichollscr commented 2 months ago

Hi @lchini, thanks again for your patience with this. I found one more thing. I realise that @durack1 and I have now thrown quite a lot at you now, so I've tried to summarise below too.

The extra thing

The time bounds values are still not coming through as expected. For example, if I look at the time bounds, the values are

>>> tmp["time_bnds"].values[:3, :]
array([[cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(854, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(855, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
      dtype=object)

What this is basically saying is that the first time step goes from 850-01-01 to 851-01-01, that's all good. However, it then says that the second timestep goes from 852-01-01 to 853-01-01 i.e. one year too far forward. For the third timestep, the bounds are 854-01-01 to 855-01-01, now two years too far forward.

This looks like some sort of stacking issue. If I look in the middle of the bounds, I see that the bounds effectively restart:

>>> tmp["time_bnds"].values[585:589, :]
array([[cftime.DatetimeNoLeap(2020, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(854, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
      dtype=object)

I think this should be an easy fix. In pseudo-code, what you want is

time_bounds = [
    time,  # start of each bound is start of the timestep
    time + 365,  # end of each bound is 365 days after the start of the timestep
].T  # then transpose it all so that time is the first axis and the bound is the second

The first few values should then look like

>>> tmp["time_bnds"].values[:3, :]
array([[cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
      dtype=object)

or, in raw values

>>> tmp["time_bnds"].values[:3, :]
array([[0, 365],
    [365, 730],
    [730, 1095],],
      dtype=object)

Summary of things to fix (as I see them):

Then I think we're golden (or, at least, very close)

lchini commented 2 months ago

OK, I think I've taken care of that list now (it took me a while to figure out why the time bounds weren't working as expected!). I've uploaded a new set of files to the FTP server. Let me know how they look.

durack1 commented 2 months ago

@lchini this is great, I can confirm in all files valid

A query about the time, these now look great, spanning the 850-2023 (or 2024-01-01 as a bound) period, but the filename suggests we have coverage from 850 to 2024. We need to rename the file I think, as our last time entry is 2023 - @lchini can you confirm. See below

(xcd061nctax) bash-4.2$ python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xarray as xr
>>> fh = xr.open_dataset("../LouiseChini-landUseChange/20240927/multiple-management_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc")
>>> fh
<xarray.Dataset>
Dimensions:      (time: 1175, lat: 720, lon: 1440, nbnd: 2)
Coordinates:
  * time         (time) object 0850-01-01 00:00:00 ... 2024-01-01 00:00:00
  * lat          (lat) float64 89.88 89.62 89.38 89.12 ... -89.38 -89.62 -89.88
  * lon          (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
Dimensions without coordinates: nbnd
Data variables: (12/36)
    fertl_c3ann  (time, lat, lon) float32 ...
    irrig_c3ann  (time, lat, lon) float32 ...
    cpbf1_c3ann  (time, lat, lon) float32 ...
    fertl_c4ann  (time, lat, lon) float32 ...
    irrig_c4ann  (time, lat, lon) float32 ...
    cpbf1_c4ann  (time, lat, lon) float32 ...
    ...           ...
    prtct_primn  (time, lat, lon) float32 ...
    prtct_secdf  (time, lat, lon) float32 ...
    prtct_secdn  (time, lat, lon) float32 ...
    prtct_pltns  (time, lat, lon) float32 ...
    addtc        (time, lat, lon) float32 ...
    time_bnds    (time, nbnd) object ...
Attributes: (12/27)
    host:                UMD College Park
    creation_date:       2024-09-27T17:30:27Z
    Conventions:         CF-1.6
    data_structure:      grid
    dataset_category:    landState
    grid_label:          gn
    ...                  ...
    references:          Hurtt et al. 2020 (https://doi.org/10.5194/gmd-13-54...
    history:             Fri Sep 27 13:31:20 2024: ncrename -a ._Fillvalue,_F...
    variable_id:         multiple-management
    nominal_resolution:  25 km
    source_version:      3.0
    tracking_id:         hdl:21.14100/d444819c-035b-4663-999d-eff2ce8170ac

>>> fh["time_bnds"].values[:-1,:]
array([[cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       ...,
       [cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(2024, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
      dtype=object)

>>> fh["time"].values[:-1]
array([cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       ...,
       cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
      dtype=object)

@znichollscr files in the path above, all 4.

znichollscr commented 2 months ago

Thanks Paul. Almost there @lchini! Now that I've seen all four files, there are a few more questions.

Overall questions

File by file

multiple-states_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc

multiple-transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc

multiple-management_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc

staticData_quarterdeg.nc

durack1 commented 2 months ago

@lchini I'd also note that I'd have to rename static_quarterdeg.nc, so if you're making any changes, a filename update to multiple-fixed_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn.nc, and then the variable_id = "multiple" noted above.

@znichollscr has already noted some other tweaks above https://github.com/PCMDI/input4MIPs_CVs/issues/123#issuecomment-2380860248 - these are very, very close!

lchini commented 2 months ago

Regarding the years that these datasets represent:

So, I would prefer to keep the filenames (and the years of data) as they are now. This is the way we have provided this data for many years now. Does this seem like a reasonable plan?

For the fertilizer units I think we can remove the "crop season" part. In theory we are providing the amount of fertilizer applied to the land per ha per year and per crop season, but since we don't actually represent double cropping in the dataset, and I don't think we have full consistency between the historical data and future scenarios on this point, I think we can remove the crop season part from the units. If we did end up feeling like that was a necessary part of the fertilizer units, is there another way that we should represent that in these files?

znichollscr commented 2 months ago
  • the transitions file provides the land-use transition that occurs during the year that begins on January 1. So this file includes data for the years 850 to 2023, and does not have a time point for 2024

Makes sense.

So, I would prefer to keep the filenames (and the years of data) as they are now. This is the way we have provided this data for many years now. Does this seem like a reasonable plan?

If you mean that the filenames would be:

Yes

If we did end up feeling like that was a necessary part of the fertilizer units, is there another way that we should represent that in these files?

3 options I see:

As a note. If it's per year, shouldn't the number of crop seasons already be included (e.g. if there were 2 crop seasons in 1876, then the total application in the year would be twice as much as a year in which there was only 1 crop season)? Or are models meant to multiple this by the number of crop seasons in their model to get total application in a year?

durack1 commented 2 months ago

At this point, I wonder whether we're good enough for the v0 land use change dataset. @vnaik60, are these files usable for the NOAA-GFDL team?

I note that for any variable, you can always add a per variable comment, any attribute could be added, which provides some context for folks to use these data.

So we'll need to rename the multiple-transitions* file to 2023, otherwise we're good to go, no?

@lchini would you prefer to make a couple more tweaks to target the questions of @znichollscr or are you good for publication to begin? As an FYI, this likely wouldn't start until Thursday this week anyway, as @sashakames is travelling