Land-use data upload - Githubissues

znichollscr commented 6 days ago

Issue for tracking the progress and any issues related to the land-use data.

cc @lchini @durack1 @vnaik60

lchini commented 6 days ago

I am attempting to upload the files to ftp.llnl.gov. The instructions indicate to upload files to the "incoming" folder, but when I connect to the FTP server (using FileZilla on my Mac) there is just an empty root directory and no folder named "incoming". Should I just upload to that location or am I not connected correctly?

durack1 commented 6 days ago

Good question @lchini, we've had a similar query from @mjevanmarle this morning.

How about if you try and connect explicitly using the IP address 198.128.250.1? See below, seems to work for me currently Screenshot 2024-09-18 at 7 50 51 AM

Once in you should be able to navigate to the incoming subdirectory, create a new directory for yourself (e.g., UofMD-landState-3-0_240918, we might have to try a couple of times, so adding a datestamp on the end) upload and voila

znichollscr commented 6 days ago

Option b: here's a python script that should work. You'll need to install input4mips-validation first.

Python script

```python import ftplib import os import traceback from pathlib import Path from input4mips_validation.upload_ftp import cd_v, login_to_ftp, mkdir_v, upload_file # Point this at the path which contains the files you want to upload # PATH_TO_DIRECTORY_TO_UPLOAD = ( "output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4" ) PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere" # Use your email here # EMAIL = "zebedee.nicholls@climate-resource.com" EMAIL = "your_email" # Use a unique value here # FTP_DIR_REL_TO_ROOT = "cr-junk-2" FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1" FTP_DIR_ROOT = "/incoming" with login_to_ftp( ftp_server="ftp.llnl.gov", username="anonymous", password=EMAIL, dry_run=False, ) as ftp: print("Opened FTP connection") print() cd_v(FTP_DIR_ROOT, ftp=ftp) mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp) cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp) n_errors = 0 n_total = 0 for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"): file_stats = os.stat(file) file_size_mb = file_stats.st_size / (1024 * 1024) file_size_gb = file_stats.st_size / (1024 * 1024 * 1024) print(f"{file=}") print(f"{file_size_mb=:.3f}") print(f"{file_size_gb=:.3f}") try: upload_file( file, strip_pre_upload=file.parent, ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}", ftp=ftp, ) print(f"Uploaded {file=}") except ftplib.error_perm: print(f"Failed to upload {file=}") traceback.print_exc() n_errors += 1 n_total += 1 print() print(f"Finished: {n_errors=}, {n_total=}") ```

durack1 commented 6 days ago

@lchini @mjevanmarle it seems like there is a DNS/network issue that is causing problems for me when I am not connected to the LLNL institutional network. Weirdly, this isn't an issue for @znichollscr, so might be something that will just work itself out, or will need a nudge within the LLNL network.

This is what I see, which looks similar to @mjevanmarle's issue, and @lchini probably your issue too Screenshot 2024-09-18 at 7 58 34 AM

I'll raise a ticket with the LLNL network folks to see if someone can check.

lchini commented 6 days ago

Yes that is the same issue that I'm experiencing. I've been trying to configure settings on my end but it sounds like I might need to wait for the LLNL server update.

znichollscr commented 6 days ago

@lchini can you try the python script and post the output here if it fails please?

znichollscr commented 6 days ago

Python script is here

import ftplib
import os
import traceback
from pathlib import Path

from input4mips_validation.upload_ftp import cd_v, login_to_ftp, mkdir_v, upload_file

# Point this at the path which contains the files you want to upload
# PATH_TO_DIRECTORY_TO_UPLOAD = (
    "output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4"
)
PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere"

# Use your email here
# EMAIL = "zebedee.nicholls@climate-resource.com"
EMAIL = "your_email"

# Use a unique value here
# FTP_DIR_REL_TO_ROOT = "cr-junk-2"
FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1"

FTP_DIR_ROOT = "/incoming"

with login_to_ftp(
    ftp_server="ftp.llnl.gov",
    username="anonymous",
    password=EMAIL,
    dry_run=False,
) as ftp:
    print("Opened FTP connection")
    print()

    cd_v(FTP_DIR_ROOT, ftp=ftp)

    mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
    cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)

    n_errors = 0
    n_total = 0
    for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"):
        file_stats = os.stat(file)
        file_size_mb = file_stats.st_size / (1024 * 1024)
        file_size_gb = file_stats.st_size / (1024 * 1024 * 1024)

        print(f"{file=}")
        print(f"{file_size_mb=:.3f}")
        print(f"{file_size_gb=:.3f}")

        try:
            upload_file(
                file,
                strip_pre_upload=file.parent,
                ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}",
                ftp=ftp,
            )
            print(f"Uploaded {file=}")

        except ftplib.error_perm:
            print(f"Failed to upload {file=}")
            traceback.print_exc()
            n_errors += 1

        n_total += 1
        print()

print(f"Finished: {n_errors=}, {n_total=}")

lchini commented 6 days ago

I tried to install input4mips-validation so that I could run the python script. I used pip since I didn't have mamba installed. Although pip did not return an error, I don't think the installation worked correctly because when I tried to run the python script I received the error: No module named 'input4mips_validation'

durack1 commented 6 days ago

I've tried to engage with LLNL comp support folks and had a tepid response, so if this doesn't begin to work next time we try, let's seek alternative paths to getting these data in the publication queues

znichollscr commented 5 days ago

I tried to install input4mips-validation so that I could run the python script. I used pip since I didn't have mamba installed. Although pip did not return an error, I don't think the installation worked correctly because when I tried to run the python script I received the error: No module named 'input4mips_validation'

Hmmm that's unfortunate. Here's a version of the script without any dependencies that aren't in the standard library so it should just work with any Python >= 3.9. Can you try that please?

import ftplib
import os
import traceback
from collections.abc import Iterator
from contextlib import contextmanager
from pathlib import Path
from typing import Optional

# Point this at the path which contains the files you want to upload
# PATH_TO_DIRECTORY_TO_UPLOAD = (
#     "output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4"
# )
PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere"

# Use your email here
# EMAIL = "zebedee.nicholls@climate-resource.com"
EMAIL = "your_email"

# Use a unique value here
# FTP_DIR_REL_TO_ROOT = "cr-junk-4"
FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1"

FTP_DIR_ROOT = "/incoming"

@contextmanager
def login_to_ftp(
    ftp_server: str, username: str, password: str, dry_run: bool
) -> Iterator[Optional[ftplib.FTP]]:
    """
    Create a connection to an FTP server.

    When the context block is excited, the connection is closed.

    If we are doing a dry run, `None` is returned instead
    to signal that no connection was actually made.
    We do, however, log messages to indicate what would have happened.

    Parameters
    ----------
    ftp_server
        FTP server to login to

    username
        Username

    password
        Password

    dry_run
        Is this a dry run?

        If `True`, we won't actually login to the FTP server.

    Yields
    ------
    :
        Connection to the FTP server.

        If it is a dry run, we simply return `None`.
    """
    if dry_run:
        print(f"Dry run. Would log in to {ftp_server} using {username=}")
        ftp = None

    else:
        ftp = ftplib.FTP(ftp_server, passwd=password, user=username)  # noqa: S321
        print(f"Logged into {ftp_server} using {username=}")

    yield ftp

    if ftp is None:
        if not dry_run:  # pragma: no cover
            raise AssertionError
        print(f"Dry run. Would close connection to {ftp_server}")

    else:
        ftp.quit()
        print(f"Closed connection to {ftp_server}")

def cd_v(dir_to_move_to: str, ftp: ftplib.FTP) -> ftplib.FTP:
    """
    Change directory verbosely

    Parameters
    ----------
    dir_to_move_to
        Directory to move to on the server

    ftp
        FTP connection

    Returns
    -------
    :
        The FTP connection
    """
    ftp.cwd(dir_to_move_to)
    print(f"Now in {ftp.pwd()} on FTP server")

    return ftp

def mkdir_v(dir_to_make: str, ftp: ftplib.FTP) -> None:
    """
    Make directory verbosely

    Also, don't fail if the directory already exists

    Parameters
    ----------
    dir_to_make
        Directory to make

    ftp
        FTP connection
    """
    try:
        print(f"Attempting to make {dir_to_make} on {ftp.host=}")
        ftp.mkd(dir_to_make)
        print(f"Made {dir_to_make} on {ftp.host=}")
    except ftplib.error_perm:
        print(f"{dir_to_make} already exists on {ftp.host=}")

def upload_file(
    file: Path,
    strip_pre_upload: Path,
    ftp_dir_upload_in: str,
    ftp: Optional[ftplib.FTP],
) -> Optional[ftplib.FTP]:
    """
    Upload a file to an FTP server

    Parameters
    ----------
    file
        File to upload.

        The full path of the file relative to `strip_pre_upload` will be uploaded.
        In other words, any directories in `file` will be made on the
        FTP server before uploading.

    strip_pre_upload
        The parts of the path that should be stripped before the file is uploaded.

        For example, if `file` is `/path/to/a/file/somewhere/file.nc`
        and `strip_pre_upload` is `/path/to/a`,
        then we will upload the file to `file/somewhere/file.nc` on the FTP server
        (relative to whatever directory the FTP server is in
        when we enter this function).

    ftp_dir_upload_in
        Directory on the FTP server in which to upload `file`
        (after removing `strip_pre_upload`).

    ftp
        FTP connection to use for the upload.

        If this is `None`, we assume this is a dry run.

    Returns
    -------
    :
        The FTP connection.

        If it is a dry run, this can simply be `None`.
    """
    print(f"Uploading {file}")
    if ftp is None:
        print(f"Dry run. Would cd on the FTP server to {ftp_dir_upload_in}")

    else:
        cd_v(ftp_dir_upload_in, ftp=ftp)

    filepath_upload = file.relative_to(strip_pre_upload)
    print(
        f"Relative to {ftp_dir_upload_in} on the FTP server, " f"will upload {file} to {filepath_upload}",
    )

    for parent in list(filepath_upload.parents)[::-1]:
        if parent == Path("."):
            continue

        to_make = parent.parts[-1]

        if ftp is None:
            print("Dry run. " "Would ensure existence of " f"and cd on the FTP server to {to_make}")

        else:
            mkdir_v(to_make, ftp=ftp)
            cd_v(to_make, ftp=ftp)

    if ftp is None:
        print(f"Dry run. Would upload {file}")

        return ftp

    with open(file, "rb") as fh:
        upload_command = f"STOR {file.name}"
        print(f"Upload command: {upload_command}")

        try:
            print(f"Initiating upload of {file}")
            ftp.storbinary(upload_command, fh)

            print(f"Successfully uploaded {file}")
        except ftplib.error_perm:
            print(
                f"{file.name} already exists on the server in {ftp.pwd()}. "
                "Use a different directory on the receiving server "
                "if you really wish to upload again."
            )
            raise

    return ftp

with login_to_ftp(
    ftp_server="ftp.llnl.gov",
    username="anonymous",
    password=EMAIL,
    dry_run=False,
) as ftp:
    print("Opened FTP connection")
    print()

    cd_v(FTP_DIR_ROOT, ftp=ftp)

    mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
    cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)

    n_errors = 0
    n_total = 0
    for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"):
        file_stats = os.stat(file)
        file_size_mb = file_stats.st_size / (1024 * 1024)
        file_size_gb = file_stats.st_size / (1024 * 1024 * 1024)

        print(f"{file=}")
        print(f"{file_size_mb=:.3f}")
        print(f"{file_size_gb=:.3f}")

        try:
            upload_file(
                file,
                strip_pre_upload=file.parent,
                ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}",
                ftp=ftp,
            )
            print(f"Uploaded {file=}")

        except ftplib.error_perm:
            print(f"Failed to upload {file=}")
            traceback.print_exc()
            n_errors += 1

        n_total += 1
        print()

print(f"Finished: {n_errors=}, {n_total=}")

lchini commented 5 days ago

Thanks for the new script Zeb. I'm running it now and it appears to be working, although it's hard to gauge progress on the other end. The first time I ran the script it appeared to be uploading ALL files within the given directory, so I canceled that and moved to a different folder. So there might be some half-uploaded files from that first run.

lchini commented 5 days ago

The script completed. Can someone else confirm that it was successful? I just uploaded a single file because I wanted to make sure everything looks OK with that one before sending the others. Let me know if there is anything I need to change with the format or metadata in the file. Also, I assume the filename will be changed from the name of the uploaded file?

znichollscr commented 5 days ago

The first time I ran the script it appeared to be uploading ALL files within the given directory, so I canceled that and moved to a different folder. So there might be some half-uploaded files from that first run.

Ah yes it uploads every .nc file it can find, should have warned you about that probably :)

Can someone else confirm that it was successful?

Hopefully @durack1 can take a look. Can you tell us which directory you uploaded in (i.e. the value of FTP_DIR_REL_TO_ROOT in the script)?

I just uploaded a single file because I wanted to make sure everything looks OK with that one before sending the others. Let me know if there is anything I need to change with the format or metadata in the file. Also, I assume the filename will be changed from the name of the uploaded file?

Sounds good. We'll take a look and get back to you asap

durack1 commented 5 days ago

@lchini great! We're off, I could see the below, so if that looks right to you - mint a new upload dir and give us the lot. Screenshot 2024-09-19 at 6 51 34 AM

If you can also indicate what we're to expect, number of files, then I can double check these and then drop them into the publication queue, where we can runs @znichollscr validator to double check

znichollscr commented 5 days ago

Alrighty looks like Paul found it so don't worry, we don't need any more info for now. I'll take a look and get back to you asap.

Also, I assume the filename will be changed from the name of the uploaded file?

Yep we'll re-write that as part of putting the file in the DRS

durack1 commented 5 days ago

@znichollscr the 2 files are in the normal place - ../LouiseChini-landUseChange/20240919

znichollscr commented 5 days ago

Alrighty:

I'm assuming that management4.nc is a half uploaded file because I couldn't even read it with ncdump...

For states_new_vars2.nc:

the source_id attribute in the file should be "UofMD-landState-3-0"
the time units, "years since 850-01-01 0:0:0", cause xarray to explode, which isn't ideal. Could these be updated to "days since 850-01-01" and you update your time axis accordingly (just multiply all the values by 365)?
there are no time bounds. For example, we have a "time_bnds" variable in our datasets where we specify the bounds of each timestep. The variable should be 2D, the first dimension being time and the second being the bounds (first value for each timestep is the start of the timestep, second value is the end of the timestep). So, if you have a time axis like [0, 1, 2] then time bounds would be something like [[0, 1], [1, 2], [2, 3]].
on all variables, "_Fillvalue" should be renamed to "_FillValue"
For the secma variable, either put a value in the "standard_name" or just remove it I think
the missing value of pltns should be a float I think, at the moment it is a string
wherever you have cell methods, there is a space missing after the colon (it should be "time: mean" not "time:mean"). (This bug causes iris to explode, which is not ideal)

Other than that, looks good I think

durack1 commented 5 days ago

@znichollscr yep looks like you're right - we have a bigger version of that file now, AND another transitions file. @lchini so I wait until it's all up, what should we be expecting, how many files and their filenames/sizes? Screenshot 2024-09-19 at 10 36 32 AM

I might wait until I've heard back from you, and wait until the complete set is down before I pull these across

lchini commented 5 days ago

The management4.nc file was uploaded in error when I didn't realize that the python script would upload all files in the given directory. So please delete that one. There are 4 files that I'll be uploading for states, transitions, and management, as well as a staticData file. The issues that Zeb pointed out with the states file will be issues in the transitions and management files too. I've already uploaded the transitions so will have to fix and re-upload that one as well as the states file, and I'll try to update the management file before I upload it.

lchini commented 5 days ago

For the time units, the product is annual. We originally created time units that give the actual year, e.g. 850, 851, 852, etc. I post-processed that to give years since 850, e.g.: 0,1,2,3 .... Should I revert to the original plan or switch to days since 850 as you suggested. We have 1175 years of data so a simple multiplication by 365 will end up missing quite a few days due to leap years.

durack1 commented 5 days ago

So please delete that one

Unfortunately I can't do anything about deleting/moving/etc on this system, it's simply a dropbox.. So good to know I'll purge it in our cop(ies) once I pull the complete file list down.

When you have the new data generated, upload this to a new directory e.g., UofMD-landState-3-0_240919_1 and that way we won't have problems with attempts to overwrite files etc, which likely won't work.

Also we have a standard template for the filenames (and directory structure, which I can impose once the files are down and their metadata matches what we expect), so this should be something like transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2023.nc

durack1 commented 5 days ago

For the time units, the product is annual. We originally created time units that give the actual year, e.g. 850, 851, 852, etc. I post-processed that to give years since 850, e.g.: 0,1,2,3 .... Should I revert to the original plan or switch to days since 850 as you suggested. We have 1175 years of data so a simple multiplication by 365 will end up missing quite a few days due to leap years.

UDUNITS defines a year to be exactly 365.242198781 days (the interval between 2 successive passages of the sun through vernal equinox, yes pedantic). So if we are mapping into days since, then we'd have to be careful about @znichollscr suggested multiplication, as this will lead to problems toward the end of the record. In addition as you span the Gregorian (1582-10-04 to 1582-10-15 the next day) hop, this is going to get a little weird.. @lchini how are you writing these files, what software? the python datetime library and cftime could help here

lchini commented 5 days ago

Our model that generates the data and writes the files is written in C++. I am doing some post-processing on the files in MATLAB (just to add in the new variables that don't have computed data yet), and then doing more post-processing (modifying the time dimension, writing global attributes, etc) using NCO command-line tools.

I guess my question is: since converting to days is tricky, is it really necessary? Especially since our data is an annual product?

durack1 commented 5 days ago

I guess my question is: since converting to days is tricky, is it really necessary? Especially since our data is an annual product?

To be honest, your file looks pretty good to me (ncdump -ct $file.nc):

variables:
        double time(time) ;
                time:axis = "T" ;
                time:calendar = "noleap" ;
                time:long_name = "time" ;
                time:realtopology = "linear" ;
                time:standard_name = "time" ;
                time:units = "years since 850-01-01 0:0:0" ;

...

data:

 time = "0850-01-01", "0851-01-01", "0852-01-01", "0853-01-01", "0854-01-01", 
    "0855-01-01", "0856-01-01", "0857-01-01", "0858-01-01", "0859-01-01", 
...
    "2015-01-01", "2016-01-01", "2017-01-01", "2018-01-01", "2019-01-01", 
    "2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01" ;

The xarray warning is

>>> fh = xcdat.open_dataset("transitions_new_vars2.nc")
../lib/python3.11/site-packages/xarray/coding/times.py:167: SerializationWarning: Ambiguous reference
date string: 850-01-01 0:0:0. The first value is assumed to be the year hence will be padded with zeros
to remove the ambiguity (the padded reference date string is: 0850-01-01 0:0:0). To remove this
message, remove the ambiguity by padding your reference date strings with zeros.
  warnings.warn(warning_msg, SerializationWarning)
>>> fh.time.data
array([cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       ...,
       cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
      dtype=object)

A quick tweak 850-1-1 0:0:0 to 0850-01-01 0:0:0 might get you most of the way there. Adding a time_bnds would be ideal too, so that we satisfy CF requirements this would mean that the first year is bounded by 0850-01-01 0:0:0 and 0851-01-01 0:0:0, this would also be cleaner if the time axis value was the central value within the annual time period, so 0850-07-02

durack1 commented 5 days ago

There's also a couple of inconsistencies in the file metadata vs what we are expecting,

// global attributes:
                :host = "UMD College Park" ;
                :creation_date = "2024-07-18T14:51:50Z" ;
                :Conventions = "CF-1.6" ;
                :data_structure = "grid" ;
                :dataset_category = "landState" ;
                :variable_id = "multiple" ;
                :grid_label = "gn" ;

                :mip_era = "CMIP6" ;  ## CMIP6Plus

                :license = "Land-Use Harmonization data produced by the University of Maryland is licensed under a Creative Commons Attribution \\\"Share Alike\\\" 4.0 International License (http://creativecommons.org/licenses
/by/4.0/). The data producers and data providers make no warranty, either express or implied, including but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the s
upply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
                :further_info_url = "http://luh.umd.edu" ;
                :frequency = "yr" ;
                :institution_id = "UofMD" ;
                :institution = "University of Maryland (UofMD), College Park, MD 20742, USA" ;
                :realm = "land" ;
                :source = "LUH3 V0: Land-Use Harmonization Data Set for CMIP7" ;
                :comment = "LUH3 V0" ;
                :title = "UofMD LUH3 V0 dataset prepared for CMIP7" ;

                :activity_id = "CMIP7" ;  ### input4MIPs

                :dataset_version_number = "LUH3 V0" ;

                :source_id = "UofMD-landState-LUH3" ;  ## UofMD-landState-3-0

                :target_mip = "CMIP7" ;  ### CMIP

                :references = "Hurtt et al. 2020, Chini et al. 2021" ; ## Want to expand these with DOIs?

                :contact = "lchini@umd.edu, gchurtt@umd.edu" ;

znichollscr commented 5 days ago

Should I revert to the original plan or switch to days since 850 as you suggested. We have 1175 years of data so a simple multiplication by 365 will end up missing quite a few days due to leap years.

(This is completely non-obvious unless you love the CF-conventions), because you're using a 'noleap' calendar, every year in your calendar has exactly 365 days. Hence, you can do the multiplication by 365 without an issue (just don't change the calendar attribute of your time variable!).

UDUNITS defines a year to be exactly 365.242198781 days (the interval between 2 successive passages of the sun through vernal equinox, yes pedantic). So if we are mapping into days since, then we'd have to be careful about @znichollscr suggested multiplication, as this will lead to problems toward the end of the record. In addition as you span the Gregorian (1582-10-04 to 1582-10-15 the next day) hop, this is going to get a little weird..

See above. Because of the calendar attribute, UDUNITS doesn't come into it and just multiplying by 365 is fine (again, this statement only applies because of the "noleap" calendar).

since converting to days is tricky, is it really necessary? Especially since our data is an annual product?

As above, because of the calendar, converting to days is trivial. The reason I would (strongly) recommend doing this is that the data doesn't load properly with xarray if the time units are "years since" rather than "days since". This is a bug in xarray, but given it is such a widely used tool, I would recommend making this tweak (particularly given how trivial it is).

The xarray warning is

Note here @durack1 that you've loaded with xcdat, not xarray (I assume xcdat is better up to speed with CF-conventions than xarray/cftime, which is what raises the original error). If you try to load with xarray you get:

click me to see the full xarray error

```python >>> import xarray as xr >>> xr.open_dataset("states_new_vars2.nc", use_cftime=True) Traceback (most recent call last): File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 218, in _decode_cf_datetime_dtype result = decode_cf_datetime(example_value, units, calendar, use_cftime) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 349, in decode_cf_datetime dates = _decode_datetime_with_cftime(flat_num_dates, units, calendar) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 242, in _decode_datetime_with_cftime cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True) File "src/cftime/_cftime.pyx", line 587, in cftime._cftime.num2date File "src/cftime/_cftime.pyx", line 105, in cftime._cftime._dateparse ValueError: In general, units must be one of 'microseconds', 'milliseconds', 'seconds', 'minutes', 'hours', or 'days' (or select abbreviated versions of these). For the '360_day' calendar, 'months' can also be used, or for the 'noleap' calendar 'common_years' can also be used. Got 'years' instead, which are not recognized. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 450, in decode_cf_variables new_vars[k] = decode_cf_variable( ^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 291, in decode_cf_variable var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 992, in decode dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 228, in _decode_cf_datetime_dtype raise ValueError(msg) ValueError: unable to decode time units 'years since 850-01-01 0:0:0' with "calendar 'noleap'". Try opening your dataset with decode_times=False or installing cftime if it is not installed. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "", line 1, in File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/api.py", line 588, in open_dataset backend_ds = backend.open_dataset( ^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 659, in open_dataset ds = store_entrypoint.open_dataset( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/store.py", line 46, in open_dataset vars, attrs, coord_names = conventions.decode_cf_variables( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 461, in decode_cf_variables raise type(e)(f"Failed to decode variable {k!r}: {e}") from e ValueError: Failed to decode variable 'time': unable to decode time units 'years since 850-01-01 0:0:0' with "calendar 'noleap'". Try opening your dataset with decode_times=False or installing cftime if it is not installed. ```

so this should be something like transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2023.nc

I don't think this matters for us though does it Paul? We'll just re-write with the correct name and save @lchini the headache? If you do want to write it yourself, the current DRS suggests the filename should start with "multiple-*" (e.g. multiple-transitions, multiple-states) because there are multiple variables in the file.

lchini commented 4 days ago

Thanks for this info. I think most of these changes will be easy to implement and I will get started on it right away. The issue of the time variable and time bounds should also be OK but I just wanted to make sure I get this right. As I understand it, the plan is the following: 1) change the time variable to be days since 850-1-1 0:0:0 and multiply the current values by 365 to convert the existing values (this is OK since my calendar is does not include leap years, and I assume it is not impacted by the 1582 calendar jump). So we will have time values of 0, 365, 730, ... 2) add a time bounds variable that is 2 dimensional - one dimension will be the same size as the time variable and the other dimension will have a length of 2. Since the time variable will be days since 850, ie 0, 365, 730, 1095 ..., the time bounds variable will be [[0,365],[365,730],[730,1095],...]

Does this sound correct?

Questions: 1) Do I also need to change 850 to 0850? 2) Regarding the idea of making the time variable the central value for the year, i.e. 0850-07-02, we actually consider the land use states in each year to be the states on jan 1, so I would prefer not to make that change. The transitions in the year 850 are actually the transitions that occur during that year (from jan 1 850 to Jan 1 851) so they are not really tied to a specific date but span that time period. So, do I need to change anything here to reflect that or leave it as it is?

znichollscr commented 4 days ago

change the time ...

All correct. (The 1582 calendar change also doesn't matter as all you're really saying with your data is, "this is the start of year" state, which is what the approach you're taking will do.)

2. add a time bounds variable...

Spot on. I think the variable is meant to be called time_bnds according to CF-conventions. When you do this, it's also recommend (perhaps required) to add a "bounds" attribute to the "time" variable that has the value "time_bnds". (In Python that would be something like ds["time"].setncattr("bounds", "time_bnds"). I know you don't use Python, but that might help make things clearer.)

Do I also need to change 850 to 0850?

I don't think it matters, but I don't think it will hurt either and it will make it easier for tools that expect 4 digits in their year so I would do this if it were me (I'm assuming it is a very easy change).

2. So, do I need to change anything here to reflect that or leave it as it is?

Given the info you have provided, I would leave as is.

znichollscr commented 4 days ago

I think most of these changes will be easy to implement and I will get started on it right away

Exciting!

durack1 commented 4 days ago

Everything that @znichollscr said is spot on.

Do I also need to change 850 to 0850?

I would. Several software packages are finicky about these things, and vanilla array (or, more correctly, pandas) is one of them. Also as inferred in the CF conventions documentation, years since is not a recommended time units for exactly the calendar reasons that are discussed above. Best to avoid such tripwires if we can. As an aside, having "days since 0000-01-01 0:0:0" also leads many software packages to blow up, as a zero year doesn't exist in any calendar - the creative ways software deals with this is interesting.

2. Regarding the idea of making the time variable the central value for the year, i.e. 0850-07-02, we actually consider the land use states in each year to be the states on jan 1, so I would prefer not to make that change.

Also fine with me. The bounds capture the time range that these states are valid for, and you have good reason to specify the start day of the year, so also fine.

I think you've got info intel from us @lchini so please chime in if there are remaining questions. Once you have a file ready to validate, just upload (the smallest of these) and we can quickly check things out, get any feedback to you and then get these finalized and published.

Margreet's biomass burning emissions data was just dropped into the ESGF publication queue this afternoon, so we are getting close to the target of nearly all datasets in place... Woo hoo!

lchini commented 1 day ago

I've made the requested changes to the land-use states file and uploaded it to the FTP server. I will work on making the same changes to the other files now as well. If you see anything in the new file that is not quite right, let me know and I will fix them.

durack1 commented 1 day ago

@lchini this looks good to me, the issues highlighted above (https://github.com/PCMDI/input4MIPs_CVs/issues/123#issuecomment-2361918646) are fixed.

It seems you've hardcoded the :creation_date = "2024-07-18T14:51:36Z", you might want to generate this automatically, so we don't have old info lurking around in files.

For a python example (which may or may not be useful), see below

$ python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime
>>> print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))
2024-09-23T21:33:04Z

@znichollscr the single files is on nimbus ../LouiseChini-landUseChange/20240923

Also a question to you, did you want to rename these files so what you are producing is consistent with what will be downloaded from ESGF? This is optional, but we will confuse folks if we have inconsistent filenames from differing sources, even if their content is identical. @znichollscr highlighted the renaming above (https://github.com/PCMDI/input4MIPs_CVs/issues/123#issuecomment-2362056200)

durack1 commented 1 day ago

And just adding another note, looks like the time axis fix has solved the xarray read problems, at least for me

 python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xarray as xr
>>> fh = xr.open_dataset("../LouiseChini-landUseChange/20240923/states_new_vars3.nc")
>>> fh
<xarray.Dataset>
Dimensions:    (time: 1175, lat: 720, lon: 1440, nbnd: 2)
Coordinates:
  * time       (time) object 0850-01-01 00:00:00 ... 2024-01-01 00:00:00
  * lat        (lat) float64 89.88 89.62 89.38 89.12 ... -89.38 -89.62 -89.88
  * lon        (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
Dimensions without coordinates: nbnd
Data variables: (12/16)
    primf      (time, lat, lon) float32 ...
    primn      (time, lat, lon) float32 ...
    secdf      (time, lat, lon) float32 ...
    secdn      (time, lat, lon) float32 ...
    urban      (time, lat, lon) float32 ...
    c3ann      (time, lat, lon) float32 ...
    ...         ...
    pastr      (time, lat, lon) float32 ...
    range      (time, lat, lon) float32 ...
    secmb      (time, lat, lon) float32 ...
    secma      (time, lat, lon) float32 ...
    pltns      (time, lat, lon) float32 ...
    time_bnds  (nbnd, time) object ...
Attributes: (12/25)
    host:                    UMD College Park
    creation_date:           2024-07-18T14:51:36Z
    Conventions:             CF-1.6
    data_structure:          grid
    dataset_category:        landState
    variable_id:             multiple
    ...                      ...
    source_id:               UofMD-landState-3-0
    target_mip:              CMIP
    mip_era:                 CMIP6Plus
    references:              Hurtt et al. 2020 (https://doi.org/10.5194/gmd-1...
    history:                 Mon Sep 23 13:31:22 2024: ncrename -a ._Fillvalu...
    NCO:                     netCDF Operators version 5.0.0 (Homepage = HTTP:...
>>>

znichollscr commented 11 hours ago

Hi @lchini looking good. Tweaks from this round below:

the variable time_bnds shouldn't have any attributes (the convention, as I understand it, is that its attributes are all assumed to be the same as time)
there shouldn't be any "cell_methods" attribute for time, lat or lon (these variables aren't the time mean of something else)
I don't think there is a standard name for secma (at least, searching here didn't show anything obvious https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html) so I would delete the "standard_name" attribute from secma, the long_name alone is enough
it turns out the conventions with time bounds are trickier than I realised. The time dimension has to come first so its dimensions should be (time, bnds) not (bnds, time) as you currently have. With that tweak, I think it should be pretty easy to write. Something like time_bnds[:, 0] = time, time_bnds[:, 1] = time + 365.
As Paul mentioned, to be safe I would change the time units attribute from "days since 0850-01-01 0:0:0" to "days since 0850-01-01 00:00:00" i.e. write "00:00:00" instead of "0:0:0" in the time units.
As Paul says, if you want to rename your file, the filename for this example would be multiple-states_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc. The transition file would be multiple-transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc. Others would be of the form multiple-<other-id>_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc.
- The "variable_id" attribute should then match "multiple-", so for your states file, variable ID should be "multiple-states", for the transitions file it should be "multiple-transitions", for others it would be "multiple-".

Thanks!

PCMDI / input4MIPs_CVs

Land-use data upload #123